Search and Resource Discovery Paradigms
Search and Resource Discovery Paradigms
MODULE 4
Search and retrieval begins when a user provides a description of the information
being sought to an automated discovery system.
Using the knowledge of the environment, the system attempts to locate the
information that matches the given description.
An information retrieval method depends on the libraries.
The challenge is to develop user in domains such as electronic shopping.
Search and retrieval methods that refine queries through various computing
techniques such as nearest neighbors, them variants of original query.
Information filtering:
The process of searching for text strings in a large collection of documents can be
divided into two phases: end-user retrieval and publisher indexing phase.
The end-user retrieval phase consists of three steps that the user performs during the
text search.
1. First, the user formulates a query, specifying in some way the material for which
the text database is to be searched.
2. Second, the server interprets the user’s query, performs the search, and returns to
the user a list of documents meeting the search criteria. The list of matching
documents returned to the user is generally called a hit list.
3. Third, the user selects documents from the hit list and browses them, reading and
perhaps printing selected portions of retrieved data.
To illustrate, if the user specifies a query to find all documents containing the
string “Electronic commerce”, the system would apply a string-matching
algorithm to all the documents to search for the string. The result might be a
retrieval of multi-fold documents. To reduce the number of documents retrieved,
some systems allow users to specify the number of documents that they would
like to see in any one search, typically based on the location of the data, with
limited per-item or per-location searching facilities. In short, the goal for the user
is to obtain a limited set of information from an on-line source to solve some need
or problem.
The publisher indexing phase consists of entering documents into the system and
creating indexes and pointers to facilitate subsequent searches.
The process of loading documents into the system and updating indexes is normally
not a concern to the user.
These two phases are highly interdependent.
Searching can be comprehensive throughout the archive (for example, WAIS servers
provide full-text indexes) or limited to certain keywords.
It enables users to search the contents of the files for any string of text that they
supply.
It uses an English language query front end a large assortment of data bases that
contains text based documents.
It allows users to search the full text of all the documents on the server.
Users on different platforms can access personal, company, and published information
from one interface i.e. text, picture, voice, or formatted document.
Anyone can use this system because it uses natural language questions to find relevant
documents.
Relevant documents can be fed back to a server to refine the search.
The servers take a user’s question and do their best to find relevant documents.
The WAIS server returns a list of documents that contain the specified phrases and
keywords.
Today, the Netscape or NCSA mosaic browser with the forms capability is often used
as a front-end to talk to WIAS sever.
WAIS has three elements: a client, a server and an indexer.
First, the indexer takes a list of files the publisher wants to index and generates from it
several index files.
These indexes include a directory of all words appearing in the database, a list of
documents and files that constitute the database.
With the index created, the user must tell the rest of the world about it. The publisher
does this by automatically running WAIS with a register option, which places this
index next to the hundreds of WAIS indexes already available on the internet.
WAIS solves a number of problems from the user’s perspective.
1. It allows users to identify and select information from large databases.
2. It provides heterogenous database access, as published databases may be on a
variety of different systems and the user need not know how to use each system.
3. It provides ways to download and organize the retrieved data so that users are not
overwhelmed.
Search Engines
The purpose of a search engine in any indexing system is simple: to find every item
that matches a query, no matter where it is located in the file system.
Search engines are now being designed to go beyond simple, broadband searches for
which WAIS is so popular.
One of the more popular approaches is used by Topic, a search engine used in Lotus
Notes, Adobe Acrobat and a variety of other products.
It uses both keywords and information searching to rank the relevance of each
document.
A different approach is offered by context-based searching. These tools let the user
enter a query and then come up with the relevant data based on the context of the
documents themselves.
Other approaches to data searching on the Web or on other wide-area networks are
available.
The most compelling is Oracle’s Context, which can go through a variety of
documents and create its own summary, pulling about three key sentences from each
document it selects.
Indexing methods:
To accomplish accuracy and conserve disk space, two types of indexing methods are
used by search engines. They are:
1. File-level indexing
2. Word-level indexing
1. File-level indexing:
It associates each indexed word with a list of all files in which that word appear at
least once.
It does not carry any information about the location of words within the file.
2. Word-level indexing:
It is more sophisticated and stores the location of every instance of a word.
These indexes enable users to search for complete phrases or words that are in close
proximity.
The disadvantage of the word-level indexing is that all the extra information they
contain gobbles up a lot of disk space – anywhere between 35 to 100 percent of the
size of the original text.
The process of indexing data is simple one ,it has large number of indexing
packages.
These indexing packages are categorized into three types, they are:
1. The client-server method: It is based on the distributed approach in which the
document database and the text search and retrieval software reside on a central
server and the data representation and user-interface software reside on the
user’s workstation. In this approach, the index file can be split into pieces
corresponding to work groups and maintained on separate servers.
2. The mainframe-based approach: It is generally more expensive and less flexible
than the previous architectures, but it provides for large amounts of storage, fast
response time, and standard data management and configuration control. The
mainframe may also handle query and display formatting, enabling searches to
be conducted from non-intelligent character based terminals.
3. The parallel-processing approach: It allows many processing units to conduct
searches simultaneously. The file to be searched is broken up into many pieces,
and each processor searches its segment of the index file. The processors may or
may not share memory and storage. The results are merged before being
presented to the user.
Over the past few years, new technologies have become incorporated into systems
that provide additional possibilities for, but also challenges to, effective search.
We have the following search technologies for effective search:
Hypertext: Richly interwoven links among items in displays allow users to
move in relatively ad hoc sequences from display to display within multimedia
database applications.
Sound: Speech input and output, music and wide variety of acoustic cues
include realistic sounds that supplement and replace visual communication.
Video: Analog or digital video input from multiple media, including video
tapes, CD-ROM, incorporated broadcast videos turners, cables and satellites
provide video imagery that supplement and replace computer-generated
graphics.
3D-images: Virtual reality displays offer a 3D environment in which all
portions of the user interface are 3D.
Robots, Wanderers and Spiders are all programs that traverse the WWW
automatically gathering information.
For E-commerce, agent-based resource discovery is becoming increasingly important
as the number of sellers increases.
A resource discovery program might fill out a form, or supply a user name and
password, to access the data of interest.
A software agent views the World Wide Web as a graph.
It starts at a set of nodes (HTML) and traverses the hypertext links in these nodes at a
certain depth beginning at a URL passed as an argument.
Only URL’s having “.” Suffixes or tagged as “HTTP:” and ending in a slash are
probed.
This method results in a limited-depth breadth-first traversal of only HTML portions
of the web.
But because of time constraint and heterogeneity of the information and of the
repositories, to perform exhaustive searches multiple software agents are required.
2. Yellow pages
The white pages are used to people or institutions and yellow pages are used to
consumers and organizations.
Analogues to the telephone white pages, the electronic white pages provide services
from a static listing of e-mail addresses to directory assistance.
White pages directories, also found within organizations, are integral to work
efficiency.
The problems facing organizations are similar to the problems facing individuals.
A white pages schema is a data model, specifically a logical schema, for organizing
the data contained in entries in a directory service, database, or application, such as an
address book.
A white pages schema typically defines, for each real-world object being represented:
What attributes of that object are to be represented in the entry for that object?
What relationships of that object to other objects are to be represented?
One of the earliest attempts to standardize a white pages schema for electronic mail
use was in X.520 and X.521, part of the X.500 a specification that was derived from
the addressing requirements of X.400.
In a white pages directory, each entry typically represents an individual person that
makes the use of network resources, such as by receiving email or having an account
to log into a system.
In some environments, the schema may also include the representation of
organizational divisions, roles, groups, and devices.
The term is derived from the white pages, the listing of individuals in a telephone
directory, typically sorted by the individual's home location (e.g. city) and then by
their name.
One of the first goal of the X.500 project has been to create a directory for keeping
track of individual electronic mail address on the internet.
X.500 offers the following features:
Decentralized maintenance
Each site running x.500 is responsible only for its local part of the directory.
Searching capabilities: x.500 provides powerful searching capabilities i.e. in the white
pages; you can search solely for users in one country. From there you can view a list
of organizations, then departments, then individual names.
This represents the tree structure.
Single global name space: x.500 provides single name space to users.
Structured information framework: X.500 defines the information framework used in
the directory, allowing local extensions.
Standards-based directory: X.500 can be used to build directory applications that
requires distributed information.
INFORMATION FILTERING
Email filtering:
Mail-filtering agents:
Users of mailing-filtering agents can instruct them to watch for items of interest in
e-mail in-boxes, on-line news services, electronic discussion forums, and the like.
The mail agent will pull the relevant information and put it in the users
personalized newspapers at predetermined intervals.
Example of Apple’s Apple Search software. Mail filters can be installed by the
user, either as separate programs (see links below), or as part of their e-mail
program (e-mail client).
In e-mail programs, users can make personal, "manual" filters that then
automatically filter mail according to the chosen criteria.
Most e-mail programs now also have an automatic spam filtering function.
Internet service providers can also install mail filters in their mail transfer agents
as a service to all of their customers. Corporations often use them to protect their
employees and their information technology assets.
News-filtering agents:
******************