0% found this document useful (0 votes)

30 views73 pages

Nformation Etrieval Ystems: P.Veera Swamy

The document discusses text search algorithms used in information retrieval systems. It describes techniques like full text scanning, word inversion, and multi-attribute retrieval. It also explains software algorithms like brute force, Knuth-Morris-Pratt, and Boyer-Moore as well as hardware text search systems using finite state automata.

Uploaded by

ganeshjaggineni1927

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views73 pages

Nformation Etrieval Ystems: P.Veera Swamy

Uploaded by

ganeshjaggineni1927

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 73

I N F O R M A TIO N R E TR I E V A L

S Y S T E M S ( IRS )

Course Instructor

P.Veera Swamy
Assistant Professor
U N I T- 5 S Y L L A B U S
 Text Search Algorithms:
⚫ Introduction
⚫ Software text search algorithms
⚫ Hardware text search systems
 Information System Evaluation:
⚫ Introduction
⚫ Measures used in system evaluation
⚫ Measurement example – T R E C results.
O V E RV I E W
 Three classical techniques for text retrieval techniques
have been defined for organizing items in a textual
database, for rapidly identifying the relevant items and for
eliminating items that do not satisfy the search.
1. Full text scanning (streaming)
2. Word inversion and
3. Multi-attribute retrieval

 most text search is performed by software, But in the

earlier history of information systems, where the limited
capabilities of hardware (CPU power, memory and disk
systems) restricted search applications, specialized
hardware text search systems were created.
I N T R O D U C T I O N TO T E X T S E A R C H
TECHNIQUES
 The basic concept of a text scanning system is the
ability for one or more users to enter queries with the
text of the items to be searched sequentially accessed
and compared to the query terms. When all of the
text has been accessed, the query is complete.
 Streaming of text is a sequential search of the text.

 This technique is used to complete a query by

searching for query terms that could not be satisfied
by the index (e.g., imbedded search terms).
 It is also frequently used to locate the search terms
for highlighting in the retrieved item prior to display.
Text Streaming Architecture
 The database contains the full text of the items.
 The term detector is the special hardware/software
that contains all of the search terms and in some
systems the logic between the terms.
⚫ It inputs the text and detects the existence of the search
terms. logic between the terms. It inputs the text and
detects the existence of the search.
 The query resolver performs two major functions:
1. Accepting search statements from the users.
2. Extracting the logic and search terms to pass to the
detector.
 It also accepts results from the detector and
determines which queries are satisfied by the item
and possibly the relevance weight associated with hit.
 The query resolver passes information to the user
interface, allowing it to continually update the search
status and, on request, retrieve any items that satisfy the
user search statement.
 The text streaming process is focused on finding at least
one or all occurrences of a pattern of text (query term) in
a text stream. It is assumed that the same alphabet is
used in both search terms and text being streamed.
 The worst case search for a pattern of m characters in a
string of n characters is at least n - m + 1 or a magnitude
of O(n).
 Some of the original brute force methods could require
O(n*m) symbol comparisons (Sedgewick-88). More recent
improvements have reduced the time to O(n + m).
 There are two approaches to the data stream.
1. The complete database is being sent to the detector(s)
which function as a search of the database.
2. Random retrieved items are being passed to the
detectors.
 In this second case, an index search is performed that constrains
the items from the database requiring additional processing,
while the text streamer performs the additional search logic that
is not satisfied by the index search
EXAMPLES W H E R E I N D E X S E A R C H E S M AY N OT
B E A B L E TO SAT I S F Y T H E C O M P L E T E S E A R C H
S TAT E M E N T ARE :

1. search for stop words

2. search for exact matches when stemming is performed
3. search for terms that contain both leading and trailing
"don't cares"
4. search for symbols that are on the inter-word symbol list
(e.g., ," ;)
5. search for "fuzzy" search terms.
 The major disadvantage of basing the search on
streaming the text is the dependency of the search on the
slowest module in the computer (the I/O module).
Inversions/indexes gain their speed by minimizing the
amount of data to be retrieved and provide the best ratio
between the total number of items delivered to the user
versus the total number of items retrieved in response to
a query.
 Use of special hardware text search units ensures a
saleable environment where performance bottlenecks
can be overcome by adding additional search units to
work in parallel on the data being streamed.
 Many of the hardware and software text searchers use
finite state automata as a basis for their search
algorithms. A finite state automata is a logical
machine that is composed of five elements:
1. I - a set of input symbols from the alphabet supported by
the automata
2. S - a set of possible states
3. P - a set of productions that define the next state based
upon the current state and input symbol
4. S o - a special state called the initial state
5. S F - a set of one or more final states from the set S
Finite State Automata
Automata Definition
S O F T WA R E T E X T S E A R C H A L G O R I T H M S
 In software streaming techniques, the item to be
searched is read into memory, and then the
algorithm is applied.
 There are four major algorithms associated with
software text search:
1. The brute force approach
2. Knuth-Morris-Pratt
3. Boyer-Moore
4. Shift-OR algorithm, and Rabin-Karp
B RU T E F O R C E A P P ROAC H
 The brute force approach is the simplest string
matching algorithm.
 The idea is to match the search string against the input
text. Whenever a mismatch is detected in the
comparison process, the input text is shifted one
position, and the comparison process is initialized and
restarted. The expected number of comparisons when
searching an input text string of n characters for a
pattern of m characters is

 where Nc is the expected number of comparisons and c

is the size of the alphabet for the text.
 For search of any large streams the number of
comparisons can be estimated by the number of
characters being searched. For smaller items the
length of the text pattern (m) can have an effect on
the number of comparisons.
 The Knuth-Pratt-Morris algorithm made a major
improvement in brute force algorithms in that even
in the worst case it does not depend upon the length
of the search term and does not require comparisons
for every character in the input stream.
K N U T H -P R AT T -M O R R I S
 The basic concept behind the algorithm is that
whenever a mismatch is detected, the previous
matched characters define the number of characters
that can be skipped in the input stream prior to
starting the comparison process again.
Example:

 The algorithm allows the comparison to jump at

least tile three positions associated with the
recognized "a b d".
Search pattern = abcabcacab.

Example of Knuth-Morris-Pratt Algorithm

SHIFT CHARACTERS TABLE
B OY E R -M O O R E A L G O R I T H M
 Boyer-Moore recognized that the string algorithm could be
significantly enhanced if the comparison process starts at
the end of the search pattern, processing right to left
versus the start of the search pattern. The advantage is
that large jumps are possible when the mismatched
character in the input stream does not exist in the search
pattern which occurs frequently.
 This leads to two possible sources of determining how
many input characters to be jumped. As in the Knuth-
Morris-Pratt technique, any characters that have been
matched in the search pattern require an alignment with
that substring. Additionally, the character in the input
stream that was mismatched also requires alignment with
its next occurrence in the search pattern or the complete
pattern can be moved. This can be defined as:
 A L G O l - on a mismatch, the character in the input
stream is compared to the search pattern to
determine the shifting of the search pattern (number
of characters in input stream to be skipped) to align
the input character to a character in the search
pattern. If the character does not exist in the search
pattern then it is possible to shift the length of the
search pattern matched to that position.
 A LG O 2 - on a mismatch occurring with a previous
matching on a substring in the input text, the
matching process can jump to the repeating
occurrence in the pattern of the initially matched sub
pattern - thus aligning that portion of the search
pattern that is in the input text.
 Upon a mismatch, the comparison process can
skip the M A X I M U M (ALGO1,ALGO2).
 The comparison starts at the right end of the
search pattern and works towards the start of the
search pattern.
 The original Boyer-Moore algorithm has been the
basis for additional text search techniques. It was
originally designed to support scanning for a single
search string. It was expanded to handle multiple
search strings on a single pass.
 Another approach based upon Knuth-Pratt-Morris
uses a finite state machine to process multiple query
terms.
 The pattern matching machine consists of a set of
states. The machine processes the input text by
successively reading in the next symbol and, based
upon the current state, makes the state transitions
while indicating matches when they occur. The
machine's operation is based upon three functions:
1. G O T O (i.e., state transition)
2. a failure function and a
3. output function.
B A E Z A -Y AT E S A N D G O N N E T A P P ROAC H
 handle "don't care" symbols and complement symbols
 The search also handles tile cases of up to k
mismatches. Their approach uses a vector of m
different states, where m is the length of tile search
pattern, and state i gives the state of the search
between the positions 1 . . . . . i of the pattern and
positions (j - i + 1) . . . . . j of the text where j is the
current position in the text.
H A R D WA R E T E X T S E A R C H S Y S T E M S
 The searcher is hardware based, scalability is
achieved by increasing the number hardware search
devices. The only limit on speed is the time it takes
to flow the text off secondary storage (i.e., disk
drives) to the searchers.
 By having one search machine per disk, the
maximum time it takes to search a database of any
size is the time to search one disk.
A DVA N TAG E S
 the disks were formatted to optimize the data flow off
of the drives.
 Another major advantage of using a hardware text
search unit is in the elimination of the index that
represents the document database.
APPROACH-1
 Specialized hardware that interfaces with computers
and is used to search secondary storage devices was
developed from the early 1970s with the most recent
product being the Parasel Searcher (previously the
Fast Data Finder).
 The need for this hardware was driven by the limits in
computer resources. The typical hardware
configuration is shown in Figure in the dashed box. The
speed of search is then based on the speed of the I/O.
 One of the earliest hardware text string search units
was the Rapid Search Machine developed by General
Electric(GE).
 The machine consisted of a special purpose search
unit in which a single query was passed against a
magnetic tape containing the documents.
 A more sophisticated search unit was developed by
Operating Systems Inc. called the Associative File
Processor (AFP).
 It is capable of searching against multiple queries at
the same time. Following that initial development,
O SI , using a different approach, developed the High
Speed Text Search (HSTS) machine.
 It uses an algorithm similar to the Aho-Corasick
software finite state machine algorithm except that it
runs three parallel state machines.
 One state machine is dedicated to contiguous word
phrases
 Another for imbedded term match and the final for
exact word match. In parallel with that development
effort, G E redesigned their Rapid Search Machine into
the G E S C A N unit.
 TRW, based upon analysis of the H ST S , decided to
develop its own text search unit. This became the Fast
Data Finder, now being marketed by Parasal.
 All of these machines were based upon state machines
that input the text string and compared them to the
query terms.
 The G E S C A N system uses a text array processor
(TAP) that simultaneously matches many terms and
conditions against a given text stream.
 The TAP receives the query information from the
user's computer and directly accesses the textual
data from secondary storage.
 The TAP consists of a large cache memory and an
array of four to 128 query processors.
 Each query processor is independent and can be
loaded at any time. A complete query is handled by
each query processor. Queries support exact term
matches, fixed length don't cares, variable length
"don't cares," terms restricted to specified zones,
Boolean logic, and proximity.
 A query processor works two operations in parallel:
⚫ Matching query terms to input text
 Term matching is performed by a series of
character cells, each containing one character of
the query. A string of character cells is
implemented on the same L S I chip and the chips
can be connected in series for longer strings.
When a word or phrase of the query is matched, a
signal is sent to the resolution sub-process on the
L S I chip.
⚫ Boolean logic resolution
 The resolution chip is responsible for resolving the

Boolean logic between terms and proximity

requirements. If the item satisfies the query, the
information is transmitted to the user's computer.
G E S C A N Text Army Processor
 Each row of the matrix is a query processor in
which the first chip performs the query resolution
while the remaining chips match query terms.
 The maximum number of characters in a query is
restricted by the length of a row while the number
of rows limit the number of simultaneous queries
that can be processed.
APPROACH-2
 Another for hardware searchers is to
approach disc storage. The augmentation is a
augment
generalized associative search element placed
between the read and write heads on the disk.
 The content addressable segment sequential memory
(CASSM) system (Roberts-78) uses these search
elements in parallel to obtain structured data from a
database.
 The C A S S M system was developed at the University
of Florida as a general purpose search device
(Copeland-73). It can be used to perform string
searching across the database.
APPROACH-3
 The Fast Data Finder (FDF) is the most recent
specialized hardware text search unit still in use in many
organizations. It was developed to search text and has
been used to search English and foreign languages.
 The early Fast Data Finders consisted of an array of
programmable text processing cells connected in series
forming a pipeline hardware search processor.
 The cells are implemented using a VS L I chip. In the
T R E C tests each chip contained 24 processor cells with a
typical system containing 3600 cells (the FDF-3 has a
rack mount configuration with 10,800 cells).
 Each cell is a comparator for a single character, limiting
the total number of characters in a query to the number
of cells. The cells are interconnected with an 8-bit data
path and approximately 20-bit control path.
 The text to be searched passes through each cell in a
pipeline fashion until the complete database has been
searched.
Fast Data Finder Architecture
 A cell is composed of both a register cell (Rs) and a
comparator (Cs).
 The input from the Document database is controlled
and buffered by the microprocessor/memory and feed
through the comparators.
 The search characters are stored in the registers.
 The connection between the registers reflects
the control lines that are also passing state
information.
 Groups of cells are used to detect query terms, along
with logic between the terms, by appropriate
programming of the control lines.
 When a pattern match is detected, a hit is passed to
the internal microprocessor that passes it back to the
host processor, allowing immediate access by the
user to the Hit item.
 The functions supported by the Fast Data Finder
are
⚫ Boolean Logic including negation
⚫ Proximity on an arbitrary pattern
⚫ Variable length "don't cares"
⚫ Term counting and thresholds
⚫ Fuzzy matching
⚫ Term weights
⚫ Numeric ranges.
I N FO R M AT I O N S Y S T E M E VA LUAT I O N :

⚫ Introduction
⚫ Measures used in system evaluation
⚫ Measurement example – T R E C results.
Search

Inf. need

Results

User
Evaluation
I N T RO D U C T I O N TO I N FO R M AT I O N SYSTEM
E VA LUAT I O N
 The evaluations focused primarily on the effectiveness of
search algorithms. The creation of the annual Text
Retrieval Evaluation Conference (TREC) sponsored by the
Defense Advanced Research Projects Agency (DARPA)
and the National Institute of Standards and Technology
(NIST) changed the standard process of evaluating
information systems.
 Conferences have been held every year, starting from
1992, usually in the Fall months. The conference provides
a standard database consisting of gigabytes of test data,
search statements and the expected results from the
searches to academic researchers and commercial
companies for testing of their systems.
 In recent years the evaluation of Information
Retrieval Systems and techniques for indexing,
sorting, searching and retrieving information have
become increasingly important.
 This growth in interest is due to two major reasons:
the growing number of retrieval systems being used
and additional focus on evaluation methods
themselves.
 There are many reasons to evaluate the
effectiveness of an Information Retrieval System
(Belkin-93, Callan-93):
⚫ To aid in the selection of a system to procure
⚫ To monitor and evaluate system effectiveness
⚫ To evaluate query generation process
for improvements
⚫ To provide inputs to cost-benefit analysis
of an information system
⚫ To determine the effects of changes made
to an existing information System
E VA LUAT I O N C R I T E R I A

 Effectiveness
⚫ System-only, human+system

 Efficiency
⚫ Retrieval time, indexing time, index size

 Usability
⚫ Learnability, novice use, expert use
 From an academic perspective, measurements are
focused on the specific effectiveness of a system and
usually are applied to determining the effects of
changing a system's algorithms or comparing
algorithms among systems.
 From a commercial perspective, measurements are
also focused on availability and reliability.
 The most important evaluation metrics of
information systems will always be biased by human
subjectivity.
 This problem arises from the specific data collected
to measure the user resources in locating relevant
information
 A factor in most metrics in determining how well a
system is working is the relevancy of items.
 Relevancy of an item, however, is not a binary
evaluation, but a continuous function between an
item's being exactly what is being looked for and its
being totally unrelated.
 Relevancy, it is necessary to define the context under
which the concept is used. From a human judgment
standpoint, relevancy can be considered:
 Subjective - depends upon a specific user's judgment
 Situational - relates to a user's requirements
 Cognitive - depends on human perception and
behavior
 Temporal - changes over time
 Measura - observable at a points in time
ble
 In a dynamic environment, each user has his own
understanding of the requirement and the threshold on
what is acceptable. Based upon his cognitive model of
the information space and the problem, the user judges
a particular item. Some users consider information they
already know to be non-relevant to their information
need.
⚫ Example: User being presented with an article that the user
wrote does not provide "new" relevant information to answer
the user's query, although the article may be very relevant to
the search statement. Also the judgment of relevance can
vary over time. Retrieving information on an "XT" class of
PCs is not of significant relevance to personal computers in
1996, but would have been valuable in 1992. Thus, relevance
judgment is measurable at a point in time constrained by the
particular users and their thresholds on acceptability of
information.
 Another way of specifying relevance is from
information, system and situational views.
1. Information View
 The information view is subjective in nature and
pertains to human judgment of the conceptual
relatedness between an item and the search.
 It involves the user's personal judgment of the
relevancy (aboutness) of the item to the user's
information need.
 When reference experts (librarians, researchers,
subject specialists, indexers) assist the user, it is
assumed they can reasonably predict whether certain
information will satisfy the user's needs.
 Ingwersen categorizes the information view into four
types of "aboutness” (Ingwersen-92):.
1. Author Aboutness - determined by the author's language
as matched by the system in natural language retrieval
2. Indexer Aboutness - determined by the indexer's
transformation of the author' s natural language into a
controlled vocabulary
3. Request Aboutness - determined by the user's or
intermediary's processing of a search statement into a
query
4. User Aboutness - determined by the indexer's attempt to
represent the document according to presupposition
about what the user will want to know
2. System View
 The system view relates to a match between query
terms and terms within an item. It can be objectively
observed, manipulated and tested without relying on
human judgment because it uses metrics associated
with the matching of the query to the item (Barry-94,
Schamber-90).
 The semantic relatedness between queries and items
is assumed to be inherited via the index terms that
represent the semantic content of the item in a
consistent and accurate fashion.
3.The Situation View
 The situation view pertains to the relationship
between information and the user's information
problem situation. It assumes that only users can
make valid judgments regarding the suitability of
information to solve their information need.
 Lancaster and Warner refer to information and
situation views as relevance and pertinence
respectively (Lancaster-93). Pertinence can be
defined as those items that satisfy the user's
information need at the time of retrieval.
M E A S U R E S U S E D I N S Y S T E M E VA L UAT I O N S
 To define the measures that can be used in
evaluating Information Retrieval Systems, it is useful
to define the major functions associated with
identifying relevant items in an information system.
 Items arrive in the system and are automatically or
manually transformed by "indexing" into searchable
data structures.
 The user determines what his information need is
and creates a search statement. The system processes
the search statement, returning potential hits. The
user selects those hits to review and accesses them.
Identifying Relevant Items
 Measurements can be made from two
perspectives:
⚫ User Perspective and
⚫ System Perspective
1.User Perspective
1. Author Aboutness - determined by the author's
language as matched by the system in natural
language retrieval
2. Indexer Aboutness - determined by the indexer's
transformation of the author' s natural language
into a controlled vocabulary
3. Request Aboutness - determined by the user's or
intermediary's processing of a search statement into
a query
4. User Aboutness - determined by the indexer's
attempt to represent the document according to
presupposition about what the user will want to
know
2.System Perspective
 Based upon aggregate functions, whereas the
user takes a more personal view.
 If a user's P C is not connecting to the system, then,
from that user's view the system is not operational.
 Techniques for collecting measurements can also
be objective or subjective.
⚫ An objective measure is one that is well-defined and
based upon numeric values derived from the system
operation.
⚫ A subjective measure can produce a number, but is based
upon an individual users judgments.
MEASURES A S S O C I AT E D I N S Y S T E M E VA L UAT I O N S

1. Search Process
2. Response Time
3. Consistency
4. Quality of the Search
5. Fallout
6. Unique Relevance Recall( U R R )
7. Novelty Ratio
8. Coverage Ratio
9. Sought Recall
SEARCH PROCESS
 This is associated with a user creating a new search
or modifying an existing query. In creating a search,
an example of an objective measure is the time
required to create the query, measured from when
the user enters into a function allowing query input
to when the query is complete.
 Completeness is defined as when the query is
executed. Although of value, file possibilities for
erroneous data (except in controlled environments)
are so great that data of this nature are not collected
in tiffs area in operational systems.
 Example: The erroneous data comes from the user
performing other activities in the middle of creating
the search such as going to get a cup of coffee.
RESPONSE TIME
 Response time is a metric frequently collected to
determine the efficiency of the search execution. Response
time is defined as the time it takes to execute the search.
The ambiguity in response time originates from the
possible definitions of file end time.
 The beginning is always correlated to when the user tells
the system to begin searching. The end time is affected by
the difference between the user's view and a system view.
From a user's perspective, a search could be considered
complete when file first result is available for tile user to
review, especially if the system has new items available
whenever a user needs to see tile next item. From a
system perspective, system resources are being used until
the search has determined all hits.
CONSISTENCY
 To ensure consistency, response time is usually associated
with the completion of the search. This is one of the most
important measurements in a production system.
Determining how well a system is working answers the
typical concern of a user: "the system is working slow
today.“
 It is difficult to define objective measures on the process of
a user selecting hits for review and reviewing them. The
problems associated with search creation apply to this
operation.
 Using time as a metric does not account for reading and
cognitive skills of the user along with the user performing
other activities during the review process.
 Data are usually gathered on the search creation and Hit
file review process by subjective techniques, such as
questionnaires to evaluate system effectiveness.
Q UA L I T Y O F T H E S E A R C H

 In addition to efficiency of the search process, the quality

of the search results are also measured by precision and
recall.
 Precision is a measure of the accuracy of the search
process. It directly evaluates the correlation of the query
to the database and indirectly is a measure of the
completeness of the indexing algorithm.
 If tile indexing algorithm tends to generalize by having a
high threshold on the index term selection process or by
using concept indexing, then precision is lower, no matter
how accurate tile similarity algorithm between query and
index.
 Recall is a measure of the ability of tile search to find all
of the relevant items that are in the database.
where Number Possible_Relevant is the number of relevant items in the
database, Number_Retrieved_Relevant is the number of relevant items in
the Hit file, and Number_Total_Retrieved is the total number of items in
the Hit File.

Number_Retrieved_Relevant and Number_Total_Retrieved, are always

available. Number_Possible-Relevant poses a problem in uncontrolled
environments because it suggests that all relevant items in the database
are known.
 To gain the insights associated with testing a search
against a large database makes collection of this data
almost impossible. Two approaches have been
suggested. The first is to use a sampling technique
across the database and performing relevance
judgments on the returned items.
 The other technique is to apply different search
strategies to the same database for the same query.
An assumption is then made that all relevant items in
the database will be found in the aggregate from all of
the searches
F ALLOUT
 Another measure that is directly related to retrieving
non-relevant items can be used in defining how
effective an information system is operating. This
measure is called Fallout and defined as

where Number_Total_Nonrelevant is the total number

of non-relevant items in the database. Fallout can be
viewed as the inverse of recall and will never encounter
the situation of 0/0 unless all the items in the database
are relevant to the search. It can be viewed as the
probability that a retrieved item is non-relevant.
 From a system perspective, the ideal system
demonstrates maximum recall and minimum fallout.
 This combination implicitly has maximum precision.
Of the three measures (precision, recall and fallout),
fallout is least sensitive to the accuracy of the search
process.
 The large value for the denominator requires
significant changes in the number of retrieved items
to affect the current value.
U N I Q U E R E L E VA N C E R EC ALL (URR)
 U R R is used to compare more two or more
algorithms or systems. It measures the number of
relevant items that are retrieved by one algorithm
that are not retrieved by the others:
 Number_unique_relevant is the number of relevant
items retrieved that were not retrieved by other
algorithms.
 When many algorithms are being compared, the
definition of uniquely found items for a particular
system can be modified, allowing a small number of
other systems to also find the same item and still be
considered unique.
 This is accomplished by defining a percentage (Pu) of
the total number of systems that can find an item and
s011 consider it unique. Number_relevant can take on
two different values based upon the objective of the
evaluation:
Example

Four Algorithms With Overlap of Relevant Retrieved

Number Relevant Items

From the diagram the following values U R R values can
be produced:
 The U R R value is used in conjunction with Precision,
Recall and Fallout to determine the total
effectiveness of an algorithm compared to
algorithms. other
 The results indicate that if we wanted to increase my
recall by running two algorithms, choose algorithm
III or I V in addition to the algorithm with the
highest recall value(47.5, 50.8).
 Novelty Ratio: ratio of relevant and not known to
the user to total relevant retrieved
 Coverage Ratio: ratio of relevant items retrieved to
total relevant by the user before the search
 Sought Recall: ratio of the total relevant reviewed
by the user after the search to the total relevant
the user would have liked to examine
M E A S U R E M E N T E X A M P L E – T R E C R E S U LT S

CHECK THE TEXT BOOK FOR MEASUREMENTS

O F T R E C O N D I F F E R E N T DATA B A S E S .

Nse4 FGT-7.0
No ratings yet
Nse4 FGT-7.0
4 pages
Brute Force Algorithm PDF
No ratings yet
Brute Force Algorithm PDF
4 pages
Online Doctor Management System Synopsis
76% (21)
Online Doctor Management System Synopsis
11 pages
RTL8370 (M)
100% (1)
RTL8370 (M)
101 pages
Unit - 5 Irs
100% (1)
Unit - 5 Irs
78 pages
Irs Unit-Iv
No ratings yet
Irs Unit-Iv
22 pages
IRS Unit 5 by by Krishna
No ratings yet
IRS Unit 5 by by Krishna
19 pages
Irs Unit 5 PDF
No ratings yet
Irs Unit 5 PDF
24 pages
Unit 5 Irs PDF
No ratings yet
Unit 5 Irs PDF
9 pages
Unit V
No ratings yet
Unit V
23 pages
IRS Unit-5
No ratings yet
IRS Unit-5
62 pages
IRSunit 5
No ratings yet
IRSunit 5
34 pages
Unit V Irs
No ratings yet
Unit V Irs
17 pages
Unit V
No ratings yet
Unit V
43 pages
Unit 5 IRS
No ratings yet
Unit 5 IRS
16 pages
Unit 5 IRS
No ratings yet
Unit 5 IRS
17 pages
Irs Mid
No ratings yet
Irs Mid
13 pages
Exact String Matchin
No ratings yet
Exact String Matchin
7 pages
Algo Lecture 7
No ratings yet
Algo Lecture 7
52 pages
Approximate String
No ratings yet
Approximate String
36 pages
IRS UNIT 5-Compressed
No ratings yet
IRS UNIT 5-Compressed
80 pages
Comparative Analysis of Brute Force and Boyer Moore Algorithms in Word Suggestion Search
No ratings yet
Comparative Analysis of Brute Force and Boyer Moore Algorithms in Word Suggestion Search
5 pages
Patternmatchingalgorithms
No ratings yet
Patternmatchingalgorithms
63 pages
ALo 2
No ratings yet
ALo 2
23 pages
ADA Lect10
No ratings yet
ADA Lect10
12 pages
String Search Algorithm
No ratings yet
String Search Algorithm
6 pages
Unit-V DS Pattern Matching and Tries
No ratings yet
Unit-V DS Pattern Matching and Tries
26 pages
Fla 03
No ratings yet
Fla 03
27 pages
Lec 3
No ratings yet
Lec 3
37 pages
Survey Paper On String Matching
No ratings yet
Survey Paper On String Matching
4 pages
Brute Force Algorithm
No ratings yet
Brute Force Algorithm
4 pages
Ads Unit5
No ratings yet
Ads Unit5
26 pages
Indexing and Searching: Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
No ratings yet
Indexing and Searching: Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
32 pages
Tania Islam
No ratings yet
Tania Islam
13 pages
IR Chap7
No ratings yet
IR Chap7
30 pages
String Search: 1 2 I I+1 I+m-1 N
No ratings yet
String Search: 1 2 I I+1 I+m-1 N
8 pages
Unit-4 Ads
100% (1)
Unit-4 Ads
31 pages
2 Studyof Different Algorithmsfor Pattern Matching
No ratings yet
2 Studyof Different Algorithmsfor Pattern Matching
7 pages
UNIT 5.3 (String Mactching)
No ratings yet
UNIT 5.3 (String Mactching)
23 pages
DSA String Matching - Part 3
No ratings yet
DSA String Matching - Part 3
6 pages
A Two Way Pattern Matching Algorithm Using Sliding Patterns
No ratings yet
A Two Way Pattern Matching Algorithm Using Sliding Patterns
5 pages
Unit-I: Introduction To Information Retrieval Systems
100% (1)
Unit-I: Introduction To Information Retrieval Systems
14 pages
Efficient String Matching: An Aid To Bibliographic Search
No ratings yet
Efficient String Matching: An Aid To Bibliographic Search
8 pages
Data Structures Unit 5
No ratings yet
Data Structures Unit 5
20 pages
605-Article Text-1711-1-10-20220831
No ratings yet
605-Article Text-1711-1-10-20220831
7 pages
Pattern Matching
No ratings yet
Pattern Matching
46 pages
Explain Item Normalization?
No ratings yet
Explain Item Normalization?
7 pages
Entropy-Based Approach in Selection Exact String-Matching Algorithms
No ratings yet
Entropy-Based Approach in Selection Exact String-Matching Algorithms
19 pages
UNIT-4 PPT New
No ratings yet
UNIT-4 PPT New
47 pages
Description of Each Project
No ratings yet
Description of Each Project
7 pages
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
No ratings yet
Introduction To Information Storage and Retrieval Systems: BY-Research Scholar
42 pages
CH 3
No ratings yet
CH 3
34 pages
Irs Unit-5
No ratings yet
Irs Unit-5
6 pages
360855
No ratings yet
360855
9 pages
IRS Notes
No ratings yet
IRS Notes
10 pages
Information Retrieval Systems U6
No ratings yet
Information Retrieval Systems U6
13 pages
Adsa Report
No ratings yet
Adsa Report
9 pages
String Matching Algorithms: International Journal of Engineering and Computer Science March 2018
No ratings yet
String Matching Algorithms: International Journal of Engineering and Computer Science March 2018
5 pages
MADF Unit 4
No ratings yet
MADF Unit 4
144 pages
Compiler
No ratings yet
Compiler
1 page
Unit-3 Irs
No ratings yet
Unit-3 Irs
48 pages
IRS Important Questions
No ratings yet
IRS Important Questions
3 pages
IRS Bits Unit-3
No ratings yet
IRS Bits Unit-3
3 pages
UG - Imonitor iDX 4.3 - T0001399 - 31 - Oct - 2023 - Rev - 2
No ratings yet
UG - Imonitor iDX 4.3 - T0001399 - 31 - Oct - 2023 - Rev - 2
308 pages
Questions OP1 OP2 OP3 OP4 AN SW ER S
No ratings yet
Questions OP1 OP2 OP3 OP4 AN SW ER S
49 pages
BTC User Manual PDF
No ratings yet
BTC User Manual PDF
78 pages
Adc Automated Testing Using Labview Software
No ratings yet
Adc Automated Testing Using Labview Software
9 pages
HP 800 G2 SSF User Manual
No ratings yet
HP 800 G2 SSF User Manual
61 pages
Bvoc
No ratings yet
Bvoc
2 pages
RF Optimization Issues in GSM
No ratings yet
RF Optimization Issues in GSM
5 pages
Lantronix
No ratings yet
Lantronix
398 pages
Rohde and Schwarz Tsma6b
No ratings yet
Rohde and Schwarz Tsma6b
157 pages
Online Recharge Portail
100% (2)
Online Recharge Portail
17 pages
FTAVIXAutonomic EN Dec 14 2021 08 36 28 38 PM - 20221027122941.297 - X
No ratings yet
FTAVIXAutonomic EN Dec 14 2021 08 36 28 38 PM - 20221027122941.297 - X
3 pages
Soal Latihan Ujian Nasional 2022
No ratings yet
Soal Latihan Ujian Nasional 2022
10 pages
EC 303 Chapter 2
No ratings yet
EC 303 Chapter 2
48 pages
FernFlex Controller Data Sheet
No ratings yet
FernFlex Controller Data Sheet
2 pages
Simple Humanoid Walking and Dancing Robot Arduino
100% (1)
Simple Humanoid Walking and Dancing Robot Arduino
12 pages
Faststart-Guide-3 4 1
No ratings yet
Faststart-Guide-3 4 1
12 pages
Ci 7 CH 3 Network
No ratings yet
Ci 7 CH 3 Network
23 pages
Question # 01: Source Code: Talha Maqsood
No ratings yet
Question # 01: Source Code: Talha Maqsood
2 pages
Wago-I/O-System 750: Manual
No ratings yet
Wago-I/O-System 750: Manual
102 pages
Infosys Certified Google Cloud Digital Leader @prepflix
No ratings yet
Infosys Certified Google Cloud Digital Leader @prepflix
4 pages
Q1. Explain JDK, JRE and JVM?
No ratings yet
Q1. Explain JDK, JRE and JVM?
1 page
Telnet & FTP: Jabatan Multimedia Pendidikan
No ratings yet
Telnet & FTP: Jabatan Multimedia Pendidikan
25 pages
C-RAN and FH Requirements - 1025
No ratings yet
C-RAN and FH Requirements - 1025
26 pages
Data Sheet: HCPL-0370, HCPL-3700, HCPL-3760
No ratings yet
Data Sheet: HCPL-0370, HCPL-3700, HCPL-3760
14 pages
HSYD100 1 Jul Dec2024 FA1 RR V.2 14052024
No ratings yet
HSYD100 1 Jul Dec2024 FA1 RR V.2 14052024
7 pages
Node MCUV3
No ratings yet
Node MCUV3
13 pages

Nformation Etrieval Ystems: P.Veera Swamy

Uploaded by

Nformation Etrieval Ystems: P.Veera Swamy

Uploaded by

I N F O R M A TIO N R E TR I E V A L

 most text search is performed by software, But in the

 This technique is used to complete a query by

1. search for stop words

 where Nc is the expected number of comparisons and c

 The algorithm allows the comparison to jump at

Example of Knuth-Morris-Pratt Algorithm

Boolean logic between terms and proximity

 In addition to efficiency of the search process, the quality

Number_Retrieved_Relevant and Number_Total_Retrieved, are always

where Number_Total_Nonrelevant is the total number

Four Algorithms With Overlap of Relevant Retrieved

Number Relevant Items

CHECK THE TEXT BOOK FOR MEASUREMENTS

You might also like