0% found this document useful (0 votes)
31 views27 pages

IRS Unit-1

Uploaded by

Venkatesh J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views27 pages

IRS Unit-1

Uploaded by

Venkatesh J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

1.

1 Introduction to Information Retrieval Systems

Definition of Information Retrieval System

Information Retrieval System:


 a system that is capable of storage, retrieval, and maintenance of information.

 Information can be composed of text (including numeric and date data), images,
audio, video, and other multi-media objects.

 consists of a software program that facilitates a user in finding the information


that the user needs.

 may use standard computer hardware or specialized hardware to support the


search sub-function and to convert non-textual sources to a searchable media
(e.g., transcription of audio to text).

 Item:

 the smallest complete unit that is processed and manipulated by the system is
called an item.

 The definition of item varies by how a specific source treats information.

 It is composed of text in the form of closed captioning, audio text provided by


the speakers, and the video images being displayed.

 Example:

 A book
 Newspaper
 magazine
 video news program

 With the advent of inexpensive powerful personnel computer processing systems


and high-speed, large-capacity secondary storage products, it has become
commercially feasible to provide large textual information databases for the
average user.

 The introduction and exponential growth of the Internet along with its initial
WAIS (Wide Area Information Servers) capability and more recent advanced
search servers (e.g., INFOSEEK, EXCITE) has provided a new avenue for
access to terabytes of information.

 The algorithms and techniques to optimize the processing and access of large

1
quantities of textual data were once the sole domain of segments of the
Government, a few industries, and academics.

 Images across the Internet are searchable from many websites such as
WEBSEEK, DITTO.COM, and ALTAVISTA/IMAGES.

 News organizations such as the BBC are processing the audio news they have
produced and are making historical audio news searchable via the audio-
transcribed versions of the news.

 Major video organizations such as Disney are using video indexing to assist in
finding specific images in their previously produced videos to use in future
videos or incorporate in advertising.

 There is a potential for confusion in the understanding of the differences between


Database Management Systems (DBMS) and Information Retrieval Systems.

 It is easy to confuse the software that optimizes functional support of each type
of system with actual information or structured data that is being stored and
manipulated.

 The importance of the differences lies in the inability of a database management


system to provide the functions needed to process “information.”

 The opposite, an information system containing structured data, also suffers


major functional deficiencies.

2
1.2 Objectives of Information Retrieval Systems

 Search composition, search execution, and reading non-relevant items are all
aspects of information retrieval overhead.

 To minimize the overhead of a user locating needed information.

 Overhead can be expressed as the time a user spends in all of the steps leading to
reading an item containing the needed information.

Example:

 query generation,
 query execution,
 scanning results of query to select items to read,
 reading non-relevant items.

 In information retrieval, the term “relevant” item is used to represent an item


containing the needed information.

 When a user decides to issue a search looking for information on a topic, the
total database is logically divided into four segments.

 Relevant items

 Documents that contain information that helps the searcher in answering


his question.

 Non-relevant items

 Items that do not provide any directly useful information.

3
 Two possibilities with respect to each item:

 it can be retrieved

 not retrieved by the user’s query.

 The two major measures commonly associated with information systems are

 Precision
 Recall

 Precision = Number_Retrieved_Relevant
----------------------------------
Number_Total_Retrieved

 Precision is directly affected by retrieval of non-relevant items thus drops to a


number close to zero.

 Recall= Number_Retrieved_Relevant
----------------------------------
Number_Possible_Relevant

 Recall is not effected by retrieval of non-relevant items

 thus remains at 100 percent efficiency once achieved.

 Natural language queries

 are becoming a standard system feature.

 allows users to state in natural language what they are interested in


finding.

 Most users on the Internet enter one or two search terms.

 the completeness of the user specification is limited by the user’s


willingness to construct long natural language queries.

 Examples of such Information Retrieval Systems are


RetrievalWare, TOPIC, AltaVista, Infoseek, and INQUERY.

4
1.3 Functional Overview

IRS is composed of four major functional processes:


 Item Normalization
 Selective Dissemination of Information (i.e., “Mail”)
 Archival Document Database Search
 Index Database Search along with Automatic File Build process.

5
1) Item Normalization:

 The first step in any integrated system is to normalize the incoming items to
a standard format.
 provides logical restructuring of the item.
 Additional operations during item normalization that are needed to create a
searchable data structure are:
 identification of processing tokens (e.g., words)
 characterization of the tokens
 stemming (e.g., removing word endings) of the tokens

 The processing tokens and their characterization are used to define the
searchable text from the total received text.
 Figure 1.5 shows the normalization process.

6
 Standardizing
 A system may have a single format for all items or allow multiple
formats.
 Takes the different external formats of input data and performs the
translation to the formats acceptable to the system.

 Example of standardization
 Translation of foreign languages into Unicode.
 Every language has a different internal binary encoding for the
characters in the language.
 One standard encoding that covers English, French, Spanish, etc. is
ISO-Latin.

 Multi-media adds an extra dimension to the normalization process.


 In addition to normalizing the textual input, the multi-media input also needs
to be standardized.
 There are a lot of options to the standards being applied to the normalization.
 If the input is video the likely digital standards will be either MPEG-2,
MPEG-1, AVI or Real Media.
 MPEG (Motion Picture Expert Group) standards are the most universal
standards for higher quality video
 Real Media is the most common standard for lower quality video being used
on the Internet.
 Audio standards are typically WAV or Real Media (Real Audio). Images
vary from JPEG to BMP.

 Logical sub-setting or Zoning


 parse the item into logical sub-divisions that have meaning to the user
 visible to the user and used to increase the precision of a search and
optimize the display.
 A typical item is sub- divided into zones, which may overlap and can
be hierarchical, such as Title, Author, Abstract, Main Text,
Conclusion, and References.

 token identification operation


 The zoning information is passed to the processing token
identification operation to store the information, allowing searches to
be restricted to a specific zone.
 For example, if the user is interested in articles discussing “Einstein”
then the search should not include the Bibliography, which could
include references to articles written by “Einstein.”

7
 Systems determine words by dividing input symbols into 3 classes:
 Valid word symbols
 Inter-word symbols
 Special processing symbols.

 word
 is defined as a contiguous set of word symbols bounded by
inter-word symbols.
 Examples of word symbols are alphabetic characters and
numbers.
 inter-word symbols
 are non-searchable and should be carefully selected.
 Examples of possible inter-word symbols are blanks, periods
and semicolons.
 The exact definition of an inter-word symbol is dependent
upon the aspects of the language domain of the items to be
processed by the system.
 For example, an apostrophe may be of little importance if only
used for the possessive case in English,

 Stop List/Algorithm
 applied to the list of potential processing tokens.
 The objective of the Stop function is to save system resources
by eliminating from the set of searchable processing tokens
those that have little value to the system.
 Given the significant increase in available cheap memory,
storage and processing power, the need to apply the Stop
function to processing tokens is decreasing.
 Examples of Stop algorithms are:
 Stop all numbers greater than “999999”,
 Stop any processing token that has numbers and
characters intermixed.
 The algorithms are typically source specific, usually eliminating
unique item numbers that are frequently found in systems and have
no search value.

 Characterize tokens
 The next step in finalizing on processing tokens is the identification of any
specific word characteristics.
 The characteristic is used in systems to assist in disambiguation of a
particular word.
 Thus, for a word such as “plane,” the system understands that it could mean

8
“level or flat” as an adjective, “aircraft or facet” as a noun, or “the act of
smoothing or evening” as a verb.
 Another example of characterization is if upper case should be preserved.
 In most systems upper/lower case is not preserved to avoid the system
having to expand a term to cover the case where it is the first word in a
sentence.
 But, for proper names, acronyms and organizations, the upper case
represents a completely different use of the processing token versus it being
found in the text.
 “Pleasant Grant” should be recognized as a person’s name versus a
“pleasant grant” that provides funding.
 Other characterizations that are typically treated separately from text are
numbers and dates.

 Applying Stemming
 Once the potential processing token has been identified and characterized,
most systems apply stemming algorithms to normalize the token to a
standard semantic representation.
 The decision to perform stemming is a tradeoff between precision of a
search (i.e., finding exactly what the query specifies) versus standardization
to reduce system overhead in expanding a search term to similar token
representations with a potential increase in recall.
 For example, the system must keep singular, plural, past tense, possessive,
etc. as separate searchable tokens and potentially expand a term at search
time to all its possible representations, or just keep the stem of the word,
eliminating endings.
 The amount of stemming that is applied can lead to retrieval of many non-
relevant items.
 Some systems such as RetrievalWare, that use a large dictionary/thesaurus,
looks up words in the existing dictionary to determine the stemmed version
in lieu of applying a sophisticated algorithm.

 Searchable data structure


 Once the processing tokens have been finalized, based upon the stemming
algorithm, they are used as updates to the searchable data structure.
 The searchable data structure is the internal representation (i.e., not visible
to the user) of items that the user query searches.
 This structure contains the semantic concepts that represent the items in the
database and limits what a user can find as a result of their search.
 When the text is associated with video or audio multi-media, the relative
time from the start of the item for each occurrence of the processing token
is needed to provide the correlation between the text and the multi-media
source.

9
2) Selective Dissemination of Information

 provides the capability to dynamically compare newly received items in the


information system against standing statements of interest of users
 deliver the item to those users whose statement of interest matches the
contents of the item.
 composed of the search process, user statements of interest (Profiles) and
user mail files.
 As each item is received, it is processed against every user’s profile.
 A profile contains a typically broad search statement along with a list of
user mail files that will receive the document if the search statement in the
profile is satisfied.
 Selective Dissemination of Information has not yet been applied to
multimedia sources.

3) Document Database Search


 provides the capability for a query to search against all items received by
the system.

 composed of the search process, user-entered queries (typically ad hoc


queries) and the document database which contains all items that have been
received, processed and stored by the system.

 Typically items in the Document Database do not change (i.e., are not
edited) once received.

4) Index Database Search


 user can logically store an item in a file along with additional index terms and
descriptive text the user wants to associate with the item for future reference.

 provides the capability to create indexes and search them.


 Index files are of two classes:

1) Public Index files


2) Private Index files
Private Index files
 Every user can have one or more private Index files leading to a very large
number of files.
 Each Private Index file references only a small subset of the total number of
items in the Document Database.
 Private Index files typically have very limited access lists.

10
Public Index files
 maintained by professional library services personnel
 typically index every item in the Document Database.
 There is a small number of Public Index files.
 These files have access lists (i.e., lists of users and their privileges) that
allow anyone to search or retrieve data.
 To assist the users in generating indexes, especially the professional
indexers, the system provides a process called Automatic File Build shown
(also called Information Extraction).

11
1.4 Relationship to Database Management Systems, Digital Libraries, Data
Warehouse

Relationship to Database Management Systems

DBMS
 Structured data is well structured facts represented by tables.
 User will get desired information for the specific request.

IRS
 High probability for not finding all relevant items.
 different vocabulary discusses one or many topics.
 Hierarchical search, Ranking is used

 From a practical standpoint, the integration of DBMS’s and Information


Retrieval Systems is very important.
 Commercial database companies have already integrated the two types of
systems.
 One of the first commercial databases to integrate the two systems into a single
view is the INQUIRE DBMS. This has been available for over fifteen years.
 A more current example is the ORACLE DBMS that now offers an imbedded
capability called CONVECTIS, which is an informational retrieval system that
uses a comprehensive thesaurus.
 The INFORMIX DBMS has the ability to link to RetrievalWare to provide
integration of structured data and information along with functions associated
with Information Retrieval Systems.

Digital Libraries and Data Warehouses (DataMarts)


 As the Internet continued its exponential growth and project funding became
available, the topic of Digital Libraries has grown.
 By 1995 enough research and pilot efforts had started to support the 1 ST ACM
International Conference on Digital Libraries.
 Indexing is one of the critical disciplines in library science and significant effort
has gone into the establishment of indexing and cataloging standards.
 Migration of many of the library products to a digital format introduces both
opportunities and challenges.
 Information Storage and Retrieval technology has addressed a small subset of the
issues associated with Digital Libraries.

 Data warehouses are similar to information storage and retrieval systems in that
they both have a need for search and retrieval of information.

12
 But a data warehouse is more focused on structured data and decision support
technologies.
 In addition to the normal search process, a complete system provides a flexible
set of analytical tools to “mine” the data.
 Data mining (originally called Knowledge Discovery in Databases - KDD) is a
search process that automatically analyzes data and extract relationships and
dependencies that were not part of the database design.

13
Information Retrieval System Capabilities

 Search Capabilities
 Objective, Weighting, Functions, Relationships, Interpretations

 Browse Capabilities
 Miscellaneous Capabilities

1.5 Search Capabilities


 Objective:
o To allow for a mapping between a user’s specified need and the items in
the information database that will answer that need.
 Search query statement
o consist of natural language text in composition style
and/or query terms with Boolean logic indicators between them.
 Weight (of search terms)
o allow a user to indicate the importance of search terms in either a Boolean
or natural language interface
o Weight holds significant potential for assisting in the location and ranking
of relevant items.
o Given a natural language query statement, the importance of a particular
search term is indicated by a value in parenthesis between 0.0 and 1.0 with
1.0 being the most important.

 search statement may apply to the complete item or contain additional parameters
limiting it to a logical division of the item (i.e., to a zone).
 Based upon the algorithms used in a system, many different functions are
associated with the system’s understanding of the search statement.
 Functions
o define the relationships between the terms in the search statement and the
interpretation of a particular word.

14
 Example of the relationships are
o Boolean
o Natural Language
o Proximity
o Contiguous Word Phrases
o Fuzzy Searches

 Example of the interpretations are


o Term Masking
o Numeric and Date Range
o Contiguous Word Phrases
o Concept/Thesaurus expansion
Boolean Logic
 allows a user to logically relate multiple concepts together to define what
information is needed.
 Boolean functions apply to processing tokens identified anywhere within an item.
 Typical Boolean operators are AND, OR, and NOT.
 These operations are implemented using set intersection, set union and set
difference procedures.
 Few systems introduced the concept of “exclusive or” but it is more complex
query using the other operators and is not generally useful to users.
 A special type of Boolean search is called “M of N” logic.
 The user lists a set of possible search terms and identifies any item that contains a
subset of the terms.
 For example, “Find any item containing any two of the following terms: “AA,”
“BB,” “CC.”
 This can be expanded into a Boolean search that performs an AND between all
combinations of two terms and “OR”s the results together ((AA AND BB) or
(AA AND CC) or (BB AND CC)).

15
Proximity
 Proximity is used to restrict the distance allowed within an item between two
search terms.
 The semantic concept is that the closer two terms are found in a text the more
likely they are related in the description of a particular concept.
 Proximity is used to increase the precision of a search.
 The typical format for proximity is:
 TERM1 within “m” “units” of TERM2
 The distance operator “m” is an integer number and units are in
Characters, Words, Sentences, or Paragraphs.

16
 A special case of the Proximity operator is the Adjacent (ADJ) operator that
normally has a distance operator of one and a forward only direction.
 Another special case is where the distance is set to zero meaning within the
same semantic unit.

Contiguous Word Phrases (CWP)


 a way of specifying a query term and a special search operator.
 A Contiguous Word Phrase is two or more words that are treated as a single semantic
unit.
 Example
 “United States of America.” The four words that specify a search term
representing a single specific semantic concept (a country).
 “manufacturing” AND “United States of America” returns any item that
contains the word “manufacturing” and the contiguous words “United States
of America.”

 CWP acts like a special search operator that is similar to the proximity (Adjacency)
operator but allows for additional specificity.
 If two terms are specified, the contiguous word phrase and the proximity operator
using directional one word parameters or the Adjacent operator are identical.
 Nested Adjacencies
 For contiguous word phrases of more than two terms, the only way of
creating an equivalent search statement using proximity and Boolean
operators is via nested Adjacencies..
 Proximity and Boolean operators are binary operators
 but contiguous word phrases are an “N”ary operator where “N” is the number of words
in the CWP.

 Other names of CWP


 Contiguous Word Phrases are called Literal Strings in WAIS.
 Exact Phrases in RetrievalWare.
 In WAIS multiple Adjacency (ADJ) operators are used to define a Literal
String (e.g., “United” ADJ “States” ADJ “of” ADJ “America”).

17
Fuzzy Searches
 Fuzzy Searches provide the capability to locate spellings of words that are similar
to the entered search term.
 used to compensate for errors in spelling of words.
 increases recall at the expense of decreasing precision (i.e., it can erroneously
identify terms as the search term).
 In the process of expanding a query term, fuzzy searching includes other terms
that have similar spellings, giving more weight (in systems that rank output) to
words in the database that have similar word lengths and position of the characters
as the entered term.
 A Fuzzy Search on the term “computer” would automatically include the
following words from the information database: “computer,” “compiter,”
“conputer,” “computter,” “compute.”

Examples for Interpretations


Term Masking
 ability to expand a query term by masking a portion of the term and accepting as
valid any processing token that maps to the unmasked portion of the term.
 The value of term masking is much higher in systems that do not perform stemming
or only provide a very simple stemming algorithm.
 There are two types of search term masking:
o fixed length
o variable length.
 Fixed length masking
o is a single position mask.
o masks out any symbol in a particular position or the lack of that position in a
word.
 Variable length “don’t cares”
o allows masking of any number of characters within a processing token.

18
 The masking may be
o in the front
o at the end
o at both front and end
o imbedded.
 The first three of these cases are called suffix search, prefix search and imbedded
character string search, respectively.
 The use of an imbedded variable length don’t care is seldom used.
 If “*” represents a variable length don’t care then the following are examples of its
use:
“*COMPUTER” Suffix Search
“COMPUTER*” Prefix Search
“*COMPUTER*” Imbedded String Search

Numeric and Date Ranges


 Term masking is useful when applied to words, but does not work for finding
ranges of numbers or numeric dates.
 To find numbers larger than “125,” using a term “125*” will not find any number
except those that begin with the digits “125.”

19
Concept/Thesaurus Expansion
 Associated with both Boolean and Natural Language Queries
 ability to expand the search terms via Thesaurus or Concept Class database
reference tool.
 thesaurus is typically a one-level or two-level expansion of a term to other terms
that are similar in meaning.
 A Concept Class is a tree structure that expands each meaning of a word into
potential concepts that are related to the initial term (e.g., in the TOPIC system).
 Concept classes are sometimes implemented as a network structure that links word
stems (e.g., in the RetrievalWare system).
 An example of Thesaurus and Concept Class structures are shown in Figure 2.4
and Figure 2.5.

20
 Thesauri are either semantic or based upon statistics.
 Semantic thesaurus
 is alisting of words and then other words that are semantically similar.
 The problem with thesauri is that they are generic to a language.
 Can introduce many search terms that are not found in the document
database.

 Statistics based Thesauri


 uses the database or a representative sample of it to create statistically related
terms.
 It is conceptually a thesaurus in that words that are statistically related to
other words by their frequently occurring together in the same items.
 is very dependent upon the database being searched
 may not be portable to other databases.

Natural Language Queries


 allows user to enter a prose statement instead of using Boolean.
 Difficulty in using negation with NL queries.
 Natural language interfaces improve the recall of systems with a decrease in
precision when negation is required.

21
1.6 Browse Capabilities

 Once the search is complete, Browse capabilities provide the user with the
capability to determine which items are of interest and select those to be
displayed.
 There are two ways of displaying a summary of the items that are associated with
a query: line item status and data visualization.
 From these summary displays, the user can select the specific items and zones
within the items for display.

 Ranking
 Zoning
 Highlighting

Ranking
 Hits are retrieved either in sorted order or time order from the newest to oldest.
 Based on the relevance score, hit results are ranked.
 Relevance score is the estimate of system search.
 Display items with relevance score and description of item.
 Typically relevance scores are normalized to a value between 0.0 and 1.0.
 The highest value of 1.0 is interpreted that the system is sure that the item is relevant
to the search statement.
 In addition to ranking, based upon the characteristics of the item and the database,
collaborative filtering (matching people with similar interests and making recommendations)
provides an option for selecting and ordering output.
 Collaborative filtering has been very successful in sites such as AMAZON.COM
MovieFinder.com, and CDNow.com in deciding what products to display to users
based upon their queries.

 Rather than limiting the number of items that can be assessed by the number of lines
on a screen, other graphical visualization techniques showing the relevance
relationships of the hit items can be used.
 For example, a two or three dimensional graph can be displayed where points on the

22
graph represent items and the location of the points represent their relative
relationship between each other and the user’s query.
 In some cases color is also used in this representation.
 It allows a user to see the clustering of items by topics and browse through a cluster
or move to another topical cluster.
Zoning
 Users want to see minimum information needed to determine if the item is
relevant or not.
 Once determination is made, user wants to display complete item for detailed
view.
 Limited display screen sizes require selectability of what portions of item needed
to make relevancy.
 Basic search item is not complete item, but algorithmic defined sub division of
item.
 Related to zoning, for use in minimizing what an end user needs to review from a
hit item is the idea of locality and passage based search and retrieval.
 Passage retrieval
 Item is divided into uniform sized passages that are indexed.
 Locality based retrieval
 Passage boundaries can be dynamic.

Highlighting
 An another display aid.
 Most systems allow the display of an item to begin with the first highlight within
the item and allow subsequent jumping to the next highlight.
 Provides the capability to determine passage in the document most relevant to the
query and position the browse to start at that passage.
 The DCARS system that acts as a user front-end to the Retrieval Ware search
system allows the user to browse an item in the order of the paragraphs or
individual words that contributed most to the rank value associated with the item.
 The highlighting may vary by introducing colors and intensities to indicate the
relative importance of a particular word in the item in the decision to retrieve the
item.

23
1.7 Miscellaneous Capabilities

 Vocabulary Browse
 Iterative Search and Search History Log
 Canned Query

Vocabulary Browse
 Vocabulary Browse provides the capability to display in alphabetical sorted order
words from the document database.
 Logically, all unique words (processing tokens) in the database are kept in sorted
order along with a count of the number of unique items in which the word is
found.
 The user can enter a word or word fragment and the system will begin to display
the dictionary around the entered text.

 It helps the user determine the impact of using a fixed or variable length mask on a
search term and potential misspellings.
 The user can determine that entering the search term “compul*” in effect is
searching for “compulsion” or “compulsive” or “compulsory.”
 It also shows that someone probably entered the word “computen” when they
really meant “computer.”

24
Iterative Search and Search History Log
 Frequently a search returns a Hit file containing many more items than the user
wants to review.
 Rather than typing in a complete new query, the results of the previous search can
be used as a constraining list to create a new query that is applied against it.
 This has the same effect as taking the original query and adding additional search
statement against it in an AND condition.
 This process of refining the results of a previous search to focus on relevant items
is called iterative search.
 This also applies when a user uses relevance feedback to enhance a previous
search.
 The search history log is the capability to display all the previous searches that
were executed during the current session.

Canned Query
 The capability to name a query and store it to be retrieved and executed during a
later user session is called canned or stored queries.
 A canned query allows a user to create and refine a search that focuses on the
user’s general area of interest one time and then retrieve it to add additional
search criteria to retrieve data that is currently needed.
 Canned query features also allow for variables to be inserted into the query and
bound to specific values at execution time.

25
1.8 Standards for IRS

Z39.50
 The Z39.50 standard does not specify an implementation, but the capabilities
within an application (Application Service) and the protocol used to communicate
between applications (Information Retrieval Application Protocol).
 It is a computer to computer communications standard for database searching and
record retrieval.
 Its objective is to overcome different system incompatibilities associated with
multiple database searching.
 The first version of Z39.50 was approved in 1992.
 An international version of Z39.50, called the Search and Retrieve Standard (SR),
was approved by the International Organization for Standardization (ISO) in 1991.
 Z39.50-1995, the latest version of Z39.50, replaces SR as the international
information retrieval standard.
 The standard describes eight operation types: Init (initialization), Search, Present,
Delete, Scan, Sort, Resource-report, and Extended Services.
 There are five types of queries (Types 0, 1, 2, 100, 101, and 102).

 The client is identified as the “Origin” and performs the communications functions
relating to initiating a search, translation of the query into a standardized format,
sending a query, and requesting return records.
 The server is identified as the “Target” and interfaces to the database at the remote
responding to requests from the Origin (e.g., pass query to database, return records in
a standardized format and status).
 The end user does not have to be aware of the details of the standard since the Origin
function performs the mapping from the user’s query interface into Z39.50 format.

 This makes the dissimilarities of different database systems transparent to the user and
facilitates issuing one query against multiple databases at different sites returning to the
user a single integrated Hit file.

26
WAIS (Wide Area Information Servers)
 Wide Area Information Service (WAIS) is the standard for many search
environments on the INTERNET.
 WAIS was developed by a project started in 1989 by three commercial companies
(Apple, Thinking Machines, and Dow Jones).
 The original idea was to create a program that would act as a personal librarian.
 A free version of WAIS is still available via the Clearinghouse for Networked
Information Discovery and Retrieval (CINDIR) called “FreeWAIS.”
 The original development of WAIS started with the 1988, Z39.50 protocol as a base
following the client/server architecture concept.
 The developers incorporated the information retrieval concepts that allow for
ranking, relevance feedback and natural language processing functions that apply to
full text searchable databases.
 Center for National Research Initiatives (CNRI) that is working with the Department
of Defense and also the American Association of Publishers (AAP), focusing on an
Internet implementation that allows for control of electronic published and copyright
material.
 In addition to the Handle Server architecture, CNRI is also advocating a
communications protocol to retrieve items from existing systems.
 Repository Archive Protocol (RAP) defines the mechanisms for clients to use the
handles to retrieve items.
 It also includes other administrative functions such as privilege validation.
 The Handle system is designed to meet the Internet Engineering Task Force (IETF)
requirements for naming Internet objects via Uniform Resource Names to replace
URLs as defined inthe Internet’s RFC-1737 (IETF- 96).
 WAIS (Wide Area Information Servers) is an Internet system in which specialized
subject databases are created at multiple server locations, kept track of by a
directory of servers at one location, and made accessible for searching by users with
WAIS client programs.
 The user of WAIS is provided with or obtains a list of distributed databases.
 The user enters a search argument for a selected database and the client then accesses
all the servers on which the database is distributed.
 The results provide a description of each text that meets the search requirements.

27

You might also like