Unit 3
Unit 3
SIGNATURE FILES
INTRODUCTION
Text retrieval methods have attracted much interest recently. There are numerous applications
involving storage and retrieval of textual data:
The main operational characteristics of all the above applications are the following two:
2. Text databases have archival nature: there are insertions in them, but almost never deletions
and updates.
3. Parallel machines
4. Distributed text db
BASIC CONCEPTS
Signature files typically use superimposed coding to create the signature of a document. A brief
description of the method follows.
For performance reasons, which will be explained later, each document is divided into "logical
blocks," that is, pieces of text that contain a constant number D of distinct, non common words.
(To improve the space overhead, a stoplist of common words is maintained.) Each such word
yields a "word signature," which is a bit pattern of size F, with m bits set to "1", while the rest
are "0". F and m are design parameters. The word signatures are OR'ed together to form the
block signature. Block signatures are concatenated, to form the document signature. The m bit
positions to be set to "1" by each word are decided by hash functions. Searching for a word is
handled by creating the signature of the word and by examining each block signature for "1" 's in
those bit positions that the signature of the search word has a "1".
Word Signature
---------------------------------
Illustration of the superimposed coding method. It is assumed that each logical block
consists of D=2 words only. The signature size F is 12 bits, m=4 bits per word .
In order to allow searching for parts of words, the following method has been suggested: Each
word is divided into successive, overlapping triplets (e.g., "fr", "fre", "ree", "ee" for the word
"free"). Each such triplet is hashed to a bit position by applying a hashing function on a
numerical encoding of the triplet, for example, considering the triplet as a base-26 number. In the
case of a word that has l triplets, with l > m, the word is allowed to set l (non distinct) bits. If l <
m, the additional bits are set using a random number generator, initialized with a numerical
encoding of the word.
An important concept in signature files is the false drop probability Fd. Intuitively, it gives the
probability that the signature test will fail, creating a "false alarm" (or "false hit" or "false drop").
Notice that the signature test never gives a false dismissal.
False drop probability, Fd, is the probability that a block signature seems to qualify, given that
the block does not actually qualify. Expressed mathematically: Fd = Prob{signature
qualifies/block does not}
The signature file is an FxN binary matrix. Previous analysis showed that, for a given value of F,
the value of m that minimizes the false drop probability is such that each row of the matrix
contains "1" 's with probability 50 percent. Under such an optimal design, we have
Fd = 2-m
F1n2 = mD
This is the reason that documents have to be divided into logical blocks: Without logical blocks,
a long document would have a signature full of "l" 's, and it would always create a false drop.
Although SSF has been used as is, it may be slow for large databases. Many methods have been
suggested, trying to improve the response time of SSF, trading off space or insertion simplicity
for speed. The main ideas behind all these methods are the following:
2. Vertical partitioning: Storing the signature matrix column wise improves the response time on
the expense of insertion time.
3. Horizontal partitioning: Grouping similar signatures together and/or providing an index on the
signature matrix may result in better-than-linear search.
COMPRESSION
In this we create sparse document signatures on purpose, and then compress them before
storing them sequentially.
The concept is to use a (large) bit vector of B bits and we hash each word into one (or
perhaps more, say n) bit position(s), which are set to "1".
The resulting bit vector will be sparse and therefore it can be compressed.
Illustration of the compression-based methods. With B = 20 and n = 1 bit per word, the
resulting bit vector is sparse and can be compressed.
Compression using run-length encoding. The notation [x] stands for the encoded value of
number x
For each bit-block we create a signature, which is of variable length and consists of at most three
parts:
Part I: It is one bit long and it indicates whether there are any there are any "l"s in the bit-block
(1) or the bit-block is empty (0). In the latter case, the bit-block signature stops here.
Part II: It indicates the number s of 1’s in the bit-block. It consists of s - 1 "1"s and a
terminating zero. This is not the optimal way to record the number of "1"s. However this
representation is simple and it seems to give results close to the optimal.
Part III: It contains the offsets of the "1"s from the beginning of the bit-block (1 gb bits for each
Figure below illustrates an example layout of the signatures in the VBC method. The upper row
corresponds to a small message with small W, while the lower row to a message with large W.
Thus, the upper row has a larger value of bopt, fewer bit-blocks, shorter Part I (the size of Part I
is the number of bit-blocks), shorter Part II (its size is W) and fewer but larger offsets in Part III
(the size of each offset is log bopt bits).
Comparison of Fd of BC (dotted line) against SSF (solid line), as a function of the space
overhead Ov. Analytical results, from Faloutsos and Christodoulakis (1987)
VERTICAL PARTITIONING
The idea behind the vertical partitioning is to avoid bringing useless portions of the document
signature in main memory; this can be achieved by storing the signature file in a bit-sliced form
or in “frame-sliced” form.
To allow insertions, we propose using F different files, one per each bit position, which will be
referred to as "bit-files." The method will be called BSSF, for "Bit-Sliced Signature Files." In
above figure illustrates the proposed file structure. Searching for a single word requires the
retrieval of m bit vectors (instead of all of the F bit vectors) which are subsequently ANDed
together. The resulting bit vector has N bits, with "1"s at the positions of the qualifying logical
blocks. An insertion of a new logical block requires F disk accesses, one for each bit-file, but no
rewriting!
File structure for Bit-Sliced Signature Files. The text file is omitted
The idea behind this method is to force each word to hash into it positions that are close to each
other in the document signature. Then, these bit files are stored together and can be retrieved
with few random disk accesses. The main motivation for this organization is that random disk
accesses are more expensive than sequential ones, since they involve movement of the disk arm.
More specifically, the method works as follows: The document signature (F bits long) is divided
into k frames of s consecutive bits each. For each word in the document, one of the k frames will
be chosen by a hash function; using another hash function, the word sets m bits (not necessarily
distinct) in that frame. F, k, s, m are design parameters.
D = 2 words. F = 12, s = 6, k = 2, m = 3. The word free is hashed into the second frame and
sets 3 bits there. The word text is hashed into the first frame and also sets 3 bits there
In FSSF, each word selects only one frame and sets m bit positions in that frame. A more general
approach is to select n distinct frames and set m bits (not necessarily distinct) in each frame to
generate the word signature. The document signature is the OR-ing of all the word signatures of
all the words in that document. This method is called Generalized Frame-Sliced Signature File
(GFSSF).
Notice that BSSF, B'SSF, FSSF, and SSF are actually special cases of GFSSF:
Performance
Since GFSSF is a generalized model, we expect that a careful choice of the parameters will give
a method that is better (whatever the criterion is) than any of its special cases. Analysis in the
above paper gives formulas for the false drop probability and the expected response time for
GFSSF and the rest of the methods. In below Figure plots the theoretically expected performance
of GFSSF, BSSF, B'SSF, and FSSF. Notice that GFSSF is faster than BSSF, B’SSF, and FSSF,
which are all its special cases. It is assumed that the transfer time for a page Ttrans = 1 msec and
the combined seek and latency time Tseek is Tseek = 40 msec
Response time vs. space overhead: a comparison between BSSF, B'SSF, FSSF and GFSSF.
Analytical results on a 2.8Mb database
HORIZONTAL PARTITIONING
The motivation behind all these methods is to avoid the sequential scanning of the signature file
(or its bit-slices), in order to achieve better than O(N) search time. Thus, they group the
signatures into sets, partitioning the signature matrix horizontally. The grouping criterion can be
decided beforehand, in the form of a hashing function h(S), where S is a document signature
(data independent case). Alternatively, the groups can be determined on the fly, using a
hierarchical structure (e.g. a B-tree--data dependent case).
Gustafson’s method:
The earliest approach was proposed by Gustafson (1971). Suppose that we have records with,
say six attributes each. For example, records can be documents and attributes can be keywords
describing the document. Consider a hashing function h that hashes a keyword w to a number
h(w) in the range 0-15. The signature of a keyword is a string of 16 bits, all of which are zero
except for the bit at position h(w). The record signature is created by superimposing the
corresponding keyword signatures. If k< 6 bits are set in a record signature, additional 6 - k bits
are set by some random method.
A portion of a document signature as a signature key to partition the signature file. For example,
we can choose the first 20 bits of a signature as its key and all signatures with the same key will
be grouped into a so-called "module."
Sacks-Davis and his colleagues (1983, 1987) suggested using two levels of signatures. Their
documents are bibliographic records of variable length. The first level of signatures consists of
document signatures that are stored sequentially, as in the SSF method. The second level consists
of “block signatures” each such signature corresponds to one block (group) of bibliographic
records, and is created by superimposing the signatures of all the words in this block, ignoring
the record boundaries. The second level is stored in a bit-sliced form. Each level has its own
hashing functions that map words to bit positions.
S-tree
Deppisch (1986) proposed a B-tree like structure to facilitate fast access to the records (which
are signatures) in a signature file. The leaf of an S-tree consists of k "similar" (i.e. ,with small
Hamming distance) document signatures along with the document identifiers. The OR-ing or
these k document signatures forms the "key" of an entry in an upper level node, which serves as
a directory for the leaves. Recursively we construct directories on lower level directories until we
reach the root. The S-tree is kept balanced in a similar manner as a B-tree: when a leaf node
overflows it is split in two groups of "similar" signatures; the father node is changed
appropriately to reflect the new situation. Splits may propagate upward until reaching the root.
The method requires small space overhead; the response time on queries is difficult to
estimate analytically. The insertion requires a few disk accesses (proportional to the height of the
tree at worst), but the append-only property is lost. Another problem is that higher level nodes
may contain keys that have many 1's and thus become useless.