0% found this document useful (0 votes)
15 views10 pages

Unit 3

Uploaded by

poorna649
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views10 pages

Unit 3

Uploaded by

poorna649
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

UNIT 3

SIGNATURE FILES
INTRODUCTION

Text retrieval methods have attracted much interest recently. There are numerous applications
involving storage and retrieval of textual data:

 Electronic office filing.


 Computerized libraries.
 Library of Medicine.
 Electronic storage and retrieval of articles from newspapers and magazines.
 Electronic encyclopedias.
 Indexing of software components to enhance reusability.
 Searching databases with descriptions of DNA molecules.

The main operational characteristics of all the above applications are the following two:

1. Text databases are traditionally large.

2. Text databases have archival nature: there are insertions in them, but almost never deletions
and updates.

A brief, qualitative comparison of the signature-based methods versus their competitors is as


follows: The signature-based methods are much faster than full text scanning. Compared to
inversion, they require a modest space overhead; moreover, they can handle insertions more
easily than inversion, because they need "append -only" operations -- no reorganization or
rewriting of any portion of the signatures. On the other hand, signature files may be slow for
large databases, precisely because their response time is linear on the number of items N in the
database. Thus, signature files have been used in the following environments:

1. PC-based, medium size db

2. Write-Once-Read –Many (WORM)

3. Parallel machines

4. Distributed text db

BASIC CONCEPTS

Signature files typically use superimposed coding to create the signature of a document. A brief
description of the method follows.
For performance reasons, which will be explained later, each document is divided into "logical
blocks," that is, pieces of text that contain a constant number D of distinct, non common words.
(To improve the space overhead, a stoplist of common words is maintained.) Each such word
yields a "word signature," which is a bit pattern of size F, with m bits set to "1", while the rest
are "0". F and m are design parameters. The word signatures are OR'ed together to form the
block signature. Block signatures are concatenated, to form the document signature. The m bit
positions to be set to "1" by each word are decided by hash functions. Searching for a word is
handled by creating the signature of the word and by examining each block signature for "1" 's in
those bit positions that the signature of the search word has a "1".

Word Signature

---------------------------------

free 001 000 110 010

text 000 010 101 001

block signature 001 010 111 011

Illustration of the superimposed coding method. It is assumed that each logical block
consists of D=2 words only. The signature size F is 12 bits, m=4 bits per word .

In order to allow searching for parts of words, the following method has been suggested: Each
word is divided into successive, overlapping triplets (e.g., "fr", "fre", "ree", "ee" for the word
"free"). Each such triplet is hashed to a bit position by applying a hashing function on a
numerical encoding of the triplet, for example, considering the triplet as a base-26 number. In the
case of a word that has l triplets, with l > m, the word is allowed to set l (non distinct) bits. If l <
m, the additional bits are set using a random number generator, initialized with a numerical
encoding of the word.

An important concept in signature files is the false drop probability Fd. Intuitively, it gives the
probability that the signature test will fail, creating a "false alarm" (or "false hit" or "false drop").
Notice that the signature test never gives a false dismissal.

False drop probability

False drop probability, Fd, is the probability that a block signature seems to qualify, given that
the block does not actually qualify. Expressed mathematically: Fd = Prob{signature
qualifies/block does not}

The signature file is an FxN binary matrix. Previous analysis showed that, for a given value of F,
the value of m that minimizes the false drop probability is such that each row of the matrix
contains "1" 's with probability 50 percent. Under such an optimal design, we have
Fd = 2-m

F1n2 = mD

This is the reason that documents have to be divided into logical blocks: Without logical blocks,
a long document would have a signature full of "l" 's, and it would always create a false drop.

Sequential Signature File

Although SSF has been used as is, it may be slow for large databases. Many methods have been
suggested, trying to improve the response time of SSF, trading off space or insertion simplicity
for speed. The main ideas behind all these methods are the following:

1. Compression: If the signature matrix is deliberately sparse, it can be compressed.

2. Vertical partitioning: Storing the signature matrix column wise improves the response time on
the expense of insertion time.

3. Horizontal partitioning: Grouping similar signatures together and/or providing an index on the
signature matrix may result in better-than-linear search.

File Structure for SSF

COMPRESSION

 In this we create sparse document signatures on purpose, and then compress them before
storing them sequentially.
 The concept is to use a (large) bit vector of B bits and we hash each word into one (or
perhaps more, say n) bit position(s), which are set to "1".
 The resulting bit vector will be sparse and therefore it can be compressed.
Illustration of the compression-based methods. With B = 20 and n = 1 bit per word, the
resulting bit vector is sparse and can be compressed.

Compression using run-length encoding. The notation [x] stands for the encoded value of
number x

Bit-block Compression (BC)


This method accelerates the search by sacrificing some space, compared to the run-length
encoding technique. The compression method is based on bit-blocks, and was called BC (for bit-
Block Compression). To speed up the searching, the sparse vector is divided into groups of
consecutive bits (bit-blocks); each bit-block is encoded individually.

For each bit-block we create a signature, which is of variable length and consists of at most three
parts:

Part I: It is one bit long and it indicates whether there are any there are any "l"s in the bit-block
(1) or the bit-block is empty (0). In the latter case, the bit-block signature stops here.

Part II: It indicates the number s of 1’s in the bit-block. It consists of s - 1 "1"s and a
terminating zero. This is not the optimal way to record the number of "1"s. However this
representation is simple and it seems to give results close to the optimal.

Part III: It contains the offsets of the "1"s from the beginning of the bit-block (1 gb bits for each

"1", where b is the bit-block size).


Illustration of the BC method with bit-block size b = 4.

Variable Bit-block Compression


The BC method was slightly modified to become insensitive to changes in the number of words
D per block. There is no need to “remember” whether some of the terms of the query have
appeared in one of the previous logical blocks of the message under inspection. The idea is to use
a different value for the bit-block size bopt for each message, according to the number W of bits
set to 1 in the sparse vector. The size of the sparse vector B is the same for all messages.

Figure below illustrates an example layout of the signatures in the VBC method. The upper row
corresponds to a small message with small W, while the lower row to a message with large W.
Thus, the upper row has a larger value of bopt, fewer bit-blocks, shorter Part I (the size of Part I
is the number of bit-blocks), shorter Part II (its size is W) and fewer but larger offsets in Part III
(the size of each offset is log bopt bits).

An example layout of the message signatures in the VBC method


Performance
With respect to space overhead, the two methods (BC and VBC) require less space than SSF for
the same false drop probability. Their response time is slightly less than SSF, due to the
decreased I/0 requirements. The required main-memory operations are more complicated
(decompression, etc.), but they are probably not the bottleneck. VBC achieves significant savings
even on main -memory operations. With respect to insertions, the two methods are almost as
easy as the SSF; they require a few additional CPU cycles to do the compression.

Comparison of Fd of BC (dotted line) against SSF (solid line), as a function of the space
overhead Ov. Analytical results, from Faloutsos and Christodoulakis (1987)

VERTICAL PARTITIONING

The idea behind the vertical partitioning is to avoid bringing useless portions of the document
signature in main memory; this can be achieved by storing the signature file in a bit-sliced form
or in “frame-sliced” form.

Bit-Sliced Signature Files (BSSF)

The bit-sliced design is illustrated in Fig below.


Transposed bit matrix

To allow insertions, we propose using F different files, one per each bit position, which will be
referred to as "bit-files." The method will be called BSSF, for "Bit-Sliced Signature Files." In
above figure illustrates the proposed file structure. Searching for a single word requires the
retrieval of m bit vectors (instead of all of the F bit vectors) which are subsequently ANDed
together. The resulting bit vector has N bits, with "1"s at the positions of the qualifying logical
blocks. An insertion of a new logical block requires F disk accesses, one for each bit-file, but no
rewriting!

File structure for Bit-Sliced Signature Files. The text file is omitted

Frame-Sliced Signature File

The idea behind this method is to force each word to hash into it positions that are close to each
other in the document signature. Then, these bit files are stored together and can be retrieved
with few random disk accesses. The main motivation for this organization is that random disk
accesses are more expensive than sequential ones, since they involve movement of the disk arm.
More specifically, the method works as follows: The document signature (F bits long) is divided
into k frames of s consecutive bits each. For each word in the document, one of the k frames will
be chosen by a hash function; using another hash function, the word sets m bits (not necessarily
distinct) in that frame. F, k, s, m are design parameters.

An example for this method:

D = 2 words. F = 12, s = 6, k = 2, m = 3. The word free is hashed into the second frame and
sets 3 bits there. The word text is hashed into the first frame and also sets 3 bits there

The Generalized Frame-Sliced Signature File (GFSSF)

In FSSF, each word selects only one frame and sets m bit positions in that frame. A more general
approach is to select n distinct frames and set m bits (not necessarily distinct) in each frame to
generate the word signature. The document signature is the OR-ing of all the word signatures of
all the words in that document. This method is called Generalized Frame-Sliced Signature File
(GFSSF).

Notice that BSSF, B'SSF, FSSF, and SSF are actually special cases of GFSSF:

 When k = F, n = m, it reduces to the BSSF or B'SSF method.


 When n = 1, it reduces to the FSSF method.
 When k = 1, n = 1, it becomes the SSF method (the document signature is broken down
to one frame only).

Performance

Since GFSSF is a generalized model, we expect that a careful choice of the parameters will give
a method that is better (whatever the criterion is) than any of its special cases. Analysis in the
above paper gives formulas for the false drop probability and the expected response time for
GFSSF and the rest of the methods. In below Figure plots the theoretically expected performance
of GFSSF, BSSF, B'SSF, and FSSF. Notice that GFSSF is faster than BSSF, B’SSF, and FSSF,
which are all its special cases. It is assumed that the transfer time for a page Ttrans = 1 msec and
the combined seek and latency time Tseek is Tseek = 40 msec
Response time vs. space overhead: a comparison between BSSF, B'SSF, FSSF and GFSSF.
Analytical results on a 2.8Mb database

HORIZONTAL PARTITIONING

The motivation behind all these methods is to avoid the sequential scanning of the signature file
(or its bit-slices), in order to achieve better than O(N) search time. Thus, they group the
signatures into sets, partitioning the signature matrix horizontally. The grouping criterion can be
decided beforehand, in the form of a hashing function h(S), where S is a document signature
(data independent case). Alternatively, the groups can be determined on the fly, using a
hierarchical structure (e.g. a B-tree--data dependent case).

Data Independent Case

Gustafson’s method:

The earliest approach was proposed by Gustafson (1971). Suppose that we have records with,
say six attributes each. For example, records can be documents and attributes can be keywords
describing the document. Consider a hashing function h that hashes a keyword w to a number
h(w) in the range 0-15. The signature of a keyword is a string of 16 bits, all of which are zero
except for the bit at position h(w). The record signature is created by superimposing the
corresponding keyword signatures. If k< 6 bits are set in a record signature, additional 6 - k bits
are set by some random method.

Although elegant, Gustafson’s method suffers from some practical problems:

 Its performance deteriorates as the file grows.


 If the number of keywords per document is large, then either we must have a huge hash
table or usual queries (involving 3-4 keywords) will touch a large portion of the database.
 Queries other than conjunctive ones are handled with difficulty

Partitioned signature files

A portion of a document signature as a signature key to partition the signature file. For example,
we can choose the first 20 bits of a signature as its key and all signatures with the same key will
be grouped into a so-called "module."

Data Dependent Case


Two-level signature files

Sacks-Davis and his colleagues (1983, 1987) suggested using two levels of signatures. Their
documents are bibliographic records of variable length. The first level of signatures consists of
document signatures that are stored sequentially, as in the SSF method. The second level consists
of “block signatures” each such signature corresponds to one block (group) of bibliographic
records, and is created by superimposing the signatures of all the words in this block, ignoring
the record boundaries. The second level is stored in a bit-sliced form. Each level has its own
hashing functions that map words to bit positions.

S-tree

Deppisch (1986) proposed a B-tree like structure to facilitate fast access to the records (which
are signatures) in a signature file. The leaf of an S-tree consists of k "similar" (i.e. ,with small
Hamming distance) document signatures along with the document identifiers. The OR-ing or
these k document signatures forms the "key" of an entry in an upper level node, which serves as
a directory for the leaves. Recursively we construct directories on lower level directories until we
reach the root. The S-tree is kept balanced in a similar manner as a B-tree: when a leaf node
overflows it is split in two groups of "similar" signatures; the father node is changed
appropriately to reflect the new situation. Splits may propagate upward until reaching the root.

The method requires small space overhead; the response time on queries is difficult to
estimate analytically. The insertion requires a few disk accesses (proportional to the height of the
tree at worst), but the append-only property is lost. Another problem is that higher level nodes
may contain keys that have many 1's and thus become useless.

You might also like