0% found this document useful (0 votes)

15 views10 pages

Unit 3

Uploaded by

poorna649

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views10 pages

Unit 3

Uploaded by

poorna649

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

UNIT 3

SIGNATURE FILES
INTRODUCTION

Text retrieval methods have attracted much interest recently. There are numerous applications
involving storage and retrieval of textual data:

 Electronic office filing.

 Computerized libraries.
 Library of Medicine.
 Electronic storage and retrieval of articles from newspapers and magazines.
 Electronic encyclopedias.
 Indexing of software components to enhance reusability.
 Searching databases with descriptions of DNA molecules.

The main operational characteristics of all the above applications are the following two:

1. Text databases are traditionally large.

2. Text databases have archival nature: there are insertions in them, but almost never deletions
and updates.

A brief, qualitative comparison of the signature-based methods versus their competitors is as

follows: The signature-based methods are much faster than full text scanning. Compared to
inversion, they require a modest space overhead; moreover, they can handle insertions more
easily than inversion, because they need "append -only" operations -- no reorganization or
rewriting of any portion of the signatures. On the other hand, signature files may be slow for
large databases, precisely because their response time is linear on the number of items N in the
database. Thus, signature files have been used in the following environments:

1. PC-based, medium size db

2. Write-Once-Read –Many (WORM)

3. Parallel machines

4. Distributed text db

BASIC CONCEPTS

Signature files typically use superimposed coding to create the signature of a document. A brief
description of the method follows.
For performance reasons, which will be explained later, each document is divided into "logical
blocks," that is, pieces of text that contain a constant number D of distinct, non common words.
(To improve the space overhead, a stoplist of common words is maintained.) Each such word
yields a "word signature," which is a bit pattern of size F, with m bits set to "1", while the rest
are "0". F and m are design parameters. The word signatures are OR'ed together to form the
block signature. Block signatures are concatenated, to form the document signature. The m bit
positions to be set to "1" by each word are decided by hash functions. Searching for a word is
handled by creating the signature of the word and by examining each block signature for "1" 's in
those bit positions that the signature of the search word has a "1".

Word Signature

---------------------------------

free 001 000 110 010

text 000 010 101 001

block signature 001 010 111 011

Illustration of the superimposed coding method. It is assumed that each logical block
consists of D=2 words only. The signature size F is 12 bits, m=4 bits per word .

In order to allow searching for parts of words, the following method has been suggested: Each
word is divided into successive, overlapping triplets (e.g., "fr", "fre", "ree", "ee" for the word
"free"). Each such triplet is hashed to a bit position by applying a hashing function on a
numerical encoding of the triplet, for example, considering the triplet as a base-26 number. In the
case of a word that has l triplets, with l > m, the word is allowed to set l (non distinct) bits. If l <
m, the additional bits are set using a random number generator, initialized with a numerical
encoding of the word.

An important concept in signature files is the false drop probability Fd. Intuitively, it gives the
probability that the signature test will fail, creating a "false alarm" (or "false hit" or "false drop").
Notice that the signature test never gives a false dismissal.

False drop probability

False drop probability, Fd, is the probability that a block signature seems to qualify, given that
the block does not actually qualify. Expressed mathematically: Fd = Prob{signature
qualifies/block does not}

The signature file is an FxN binary matrix. Previous analysis showed that, for a given value of F,
the value of m that minimizes the false drop probability is such that each row of the matrix
contains "1" 's with probability 50 percent. Under such an optimal design, we have
Fd = 2-m

F1n2 = mD

This is the reason that documents have to be divided into logical blocks: Without logical blocks,
a long document would have a signature full of "l" 's, and it would always create a false drop.

Sequential Signature File

Although SSF has been used as is, it may be slow for large databases. Many methods have been
suggested, trying to improve the response time of SSF, trading off space or insertion simplicity
for speed. The main ideas behind all these methods are the following:

1. Compression: If the signature matrix is deliberately sparse, it can be compressed.

2. Vertical partitioning: Storing the signature matrix column wise improves the response time on
the expense of insertion time.

3. Horizontal partitioning: Grouping similar signatures together and/or providing an index on the
signature matrix may result in better-than-linear search.

File Structure for SSF

COMPRESSION

 In this we create sparse document signatures on purpose, and then compress them before
storing them sequentially.
 The concept is to use a (large) bit vector of B bits and we hash each word into one (or
perhaps more, say n) bit position(s), which are set to "1".
 The resulting bit vector will be sparse and therefore it can be compressed.
Illustration of the compression-based methods. With B = 20 and n = 1 bit per word, the
resulting bit vector is sparse and can be compressed.

Compression using run-length encoding. The notation [x] stands for the encoded value of
number x

Bit-block Compression (BC)

This method accelerates the search by sacrificing some space, compared to the run-length
encoding technique. The compression method is based on bit-blocks, and was called BC (for bit-
Block Compression). To speed up the searching, the sparse vector is divided into groups of
consecutive bits (bit-blocks); each bit-block is encoded individually.

For each bit-block we create a signature, which is of variable length and consists of at most three
parts:

Part I: It is one bit long and it indicates whether there are any there are any "l"s in the bit-block
(1) or the bit-block is empty (0). In the latter case, the bit-block signature stops here.

Part II: It indicates the number s of 1’s in the bit-block. It consists of s - 1 "1"s and a
terminating zero. This is not the optimal way to record the number of "1"s. However this
representation is simple and it seems to give results close to the optimal.

Part III: It contains the offsets of the "1"s from the beginning of the bit-block (1 gb bits for each

"1", where b is the bit-block size).

Illustration of the BC method with bit-block size b = 4.

Variable Bit-block Compression

The BC method was slightly modified to become insensitive to changes in the number of words
D per block. There is no need to “remember” whether some of the terms of the query have
appeared in one of the previous logical blocks of the message under inspection. The idea is to use
a different value for the bit-block size bopt for each message, according to the number W of bits
set to 1 in the sparse vector. The size of the sparse vector B is the same for all messages.

Figure below illustrates an example layout of the signatures in the VBC method. The upper row
corresponds to a small message with small W, while the lower row to a message with large W.
Thus, the upper row has a larger value of bopt, fewer bit-blocks, shorter Part I (the size of Part I
is the number of bit-blocks), shorter Part II (its size is W) and fewer but larger offsets in Part III
(the size of each offset is log bopt bits).

An example layout of the message signatures in the VBC method

Performance
With respect to space overhead, the two methods (BC and VBC) require less space than SSF for
the same false drop probability. Their response time is slightly less than SSF, due to the
decreased I/0 requirements. The required main-memory operations are more complicated
(decompression, etc.), but they are probably not the bottleneck. VBC achieves significant savings
even on main -memory operations. With respect to insertions, the two methods are almost as
easy as the SSF; they require a few additional CPU cycles to do the compression.

Comparison of Fd of BC (dotted line) against SSF (solid line), as a function of the space
overhead Ov. Analytical results, from Faloutsos and Christodoulakis (1987)

VERTICAL PARTITIONING

The idea behind the vertical partitioning is to avoid bringing useless portions of the document
signature in main memory; this can be achieved by storing the signature file in a bit-sliced form
or in “frame-sliced” form.

Bit-Sliced Signature Files (BSSF)

The bit-sliced design is illustrated in Fig below.

Transposed bit matrix

To allow insertions, we propose using F different files, one per each bit position, which will be
referred to as "bit-files." The method will be called BSSF, for "Bit-Sliced Signature Files." In
above figure illustrates the proposed file structure. Searching for a single word requires the
retrieval of m bit vectors (instead of all of the F bit vectors) which are subsequently ANDed
together. The resulting bit vector has N bits, with "1"s at the positions of the qualifying logical
blocks. An insertion of a new logical block requires F disk accesses, one for each bit-file, but no
rewriting!

File structure for Bit-Sliced Signature Files. The text file is omitted

Frame-Sliced Signature File

The idea behind this method is to force each word to hash into it positions that are close to each
other in the document signature. Then, these bit files are stored together and can be retrieved
with few random disk accesses. The main motivation for this organization is that random disk
accesses are more expensive than sequential ones, since they involve movement of the disk arm.
More specifically, the method works as follows: The document signature (F bits long) is divided
into k frames of s consecutive bits each. For each word in the document, one of the k frames will
be chosen by a hash function; using another hash function, the word sets m bits (not necessarily
distinct) in that frame. F, k, s, m are design parameters.

An example for this method:

D = 2 words. F = 12, s = 6, k = 2, m = 3. The word free is hashed into the second frame and
sets 3 bits there. The word text is hashed into the first frame and also sets 3 bits there

The Generalized Frame-Sliced Signature File (GFSSF)

In FSSF, each word selects only one frame and sets m bit positions in that frame. A more general
approach is to select n distinct frames and set m bits (not necessarily distinct) in each frame to
generate the word signature. The document signature is the OR-ing of all the word signatures of
all the words in that document. This method is called Generalized Frame-Sliced Signature File
(GFSSF).

Notice that BSSF, B'SSF, FSSF, and SSF are actually special cases of GFSSF:

 When k = F, n = m, it reduces to the BSSF or B'SSF method.

 When n = 1, it reduces to the FSSF method.
 When k = 1, n = 1, it becomes the SSF method (the document signature is broken down
to one frame only).

Performance

Since GFSSF is a generalized model, we expect that a careful choice of the parameters will give
a method that is better (whatever the criterion is) than any of its special cases. Analysis in the
above paper gives formulas for the false drop probability and the expected response time for
GFSSF and the rest of the methods. In below Figure plots the theoretically expected performance
of GFSSF, BSSF, B'SSF, and FSSF. Notice that GFSSF is faster than BSSF, B’SSF, and FSSF,
which are all its special cases. It is assumed that the transfer time for a page Ttrans = 1 msec and
the combined seek and latency time Tseek is Tseek = 40 msec
Response time vs. space overhead: a comparison between BSSF, B'SSF, FSSF and GFSSF.
Analytical results on a 2.8Mb database

HORIZONTAL PARTITIONING

The motivation behind all these methods is to avoid the sequential scanning of the signature file
(or its bit-slices), in order to achieve better than O(N) search time. Thus, they group the
signatures into sets, partitioning the signature matrix horizontally. The grouping criterion can be
decided beforehand, in the form of a hashing function h(S), where S is a document signature
(data independent case). Alternatively, the groups can be determined on the fly, using a
hierarchical structure (e.g. a B-tree--data dependent case).

Data Independent Case

Gustafson’s method:

The earliest approach was proposed by Gustafson (1971). Suppose that we have records with,
say six attributes each. For example, records can be documents and attributes can be keywords
describing the document. Consider a hashing function h that hashes a keyword w to a number
h(w) in the range 0-15. The signature of a keyword is a string of 16 bits, all of which are zero
except for the bit at position h(w). The record signature is created by superimposing the
corresponding keyword signatures. If k< 6 bits are set in a record signature, additional 6 - k bits
are set by some random method.

Although elegant, Gustafson’s method suffers from some practical problems:

 Its performance deteriorates as the file grows.

 If the number of keywords per document is large, then either we must have a huge hash
table or usual queries (involving 3-4 keywords) will touch a large portion of the database.
 Queries other than conjunctive ones are handled with difficulty

Partitioned signature files

A portion of a document signature as a signature key to partition the signature file. For example,
we can choose the first 20 bits of a signature as its key and all signatures with the same key will
be grouped into a so-called "module."

Data Dependent Case

Two-level signature files

Sacks-Davis and his colleagues (1983, 1987) suggested using two levels of signatures. Their
documents are bibliographic records of variable length. The first level of signatures consists of
document signatures that are stored sequentially, as in the SSF method. The second level consists
of “block signatures” each such signature corresponds to one block (group) of bibliographic
records, and is created by superimposing the signatures of all the words in this block, ignoring
the record boundaries. The second level is stored in a bit-sliced form. Each level has its own
hashing functions that map words to bit positions.

S-tree

Deppisch (1986) proposed a B-tree like structure to facilitate fast access to the records (which
are signatures) in a signature file. The leaf of an S-tree consists of k "similar" (i.e. ,with small
Hamming distance) document signatures along with the document identifiers. The OR-ing or
these k document signatures forms the "key" of an entry in an upper level node, which serves as
a directory for the leaves. Recursively we construct directories on lower level directories until we
reach the root. The S-tree is kept balanced in a similar manner as a B-tree: when a leaf node
overflows it is split in two groups of "similar" signatures; the father node is changed
appropriately to reflect the new situation. Splits may propagate upward until reaching the root.

The method requires small space overhead; the response time on queries is difficult to
estimate analytically. The insertion requires a few disk accesses (proportional to the height of the
tree at worst), but the append-only property is lost. Another problem is that higher level nodes
may contain keys that have many 1's and thus become useless.

Data Compression Seminar Report
67% (6)
Data Compression Seminar Report
34 pages
Accenture Sap Fico Interview Questions
100% (1)
Accenture Sap Fico Interview Questions
4 pages
Data Structure Exam Sample - 2020
No ratings yet
Data Structure Exam Sample - 2020
12 pages
Whack A Mole FPGA Report
100% (1)
Whack A Mole FPGA Report
8 pages
PAT Trees and PAT Arrays
No ratings yet
PAT Trees and PAT Arrays
12 pages
Wholesale Product Price
No ratings yet
Wholesale Product Price
9 pages
SK Jejaring Puskesmas - PDF
No ratings yet
SK Jejaring Puskesmas - PDF
1 page
A Technique For Isolating Differences Between Files
100% (4)
A Technique For Isolating Differences Between Files
5 pages
Log
No ratings yet
Log
3,210 pages
Unit 2
No ratings yet
Unit 2
10 pages
Unit 4
No ratings yet
Unit 4
9 pages
Sample
No ratings yet
Sample
13 pages
Algorithms 06 00319
No ratings yet
Algorithms 06 00319
33 pages
CCN Lab Manual Vtu
50% (2)
CCN Lab Manual Vtu
21 pages
Parker Rotary Knife
No ratings yet
Parker Rotary Knife
14 pages
Get A VPS Completely Free!: by Sakib Hasan
No ratings yet
Get A VPS Completely Free!: by Sakib Hasan
6 pages
Lecture 10-Print
No ratings yet
Lecture 10-Print
50 pages
Design of Test Data Compressor/Decompressor Using Xmatchpro Method
No ratings yet
Design of Test Data Compressor/Decompressor Using Xmatchpro Method
10 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
43 pages
Why Compress?: Lossless Compression
No ratings yet
Why Compress?: Lossless Compression
28 pages
1-Data Compression-2022
No ratings yet
1-Data Compression-2022
24 pages
Data Compression
No ratings yet
Data Compression
35 pages
EuroLink PRO PDF
No ratings yet
EuroLink PRO PDF
2 pages
On Improving Tunstall Codes: Shmuel T. Klein and Dana Shapira
No ratings yet
On Improving Tunstall Codes: Shmuel T. Klein and Dana Shapira
16 pages
Dmslecture 7
No ratings yet
Dmslecture 7
22 pages
All in One QA Jobs 15 Apr
No ratings yet
All in One QA Jobs 15 Apr
24 pages
System Development Life Cycle SDLC
No ratings yet
System Development Life Cycle SDLC
10 pages
Introduction To Cloud Computing-Oracle OCI
No ratings yet
Introduction To Cloud Computing-Oracle OCI
4 pages
A New Data Structure For Cumulative Frequency Tables: Peter M. Fenwick
No ratings yet
A New Data Structure For Cumulative Frequency Tables: Peter M. Fenwick
10 pages
31 Huffman Encoding
No ratings yet
31 Huffman Encoding
10 pages
TCOM 370: NOTES 99-10 Data Compression
No ratings yet
TCOM 370: NOTES 99-10 Data Compression
19 pages
On Ternary Coding and Three-Valued Logic: Abstract
No ratings yet
On Ternary Coding and Three-Valued Logic: Abstract
12 pages
Fletcher
No ratings yet
Fletcher
8 pages
Burroughs B1700 Memory Utilization
No ratings yet
Burroughs B1700 Memory Utilization
8 pages
Journal of Discrete Algorithms: Sergio de Agostino
No ratings yet
Journal of Discrete Algorithms: Sergio de Agostino
8 pages
Compression
No ratings yet
Compression
21 pages
UNIT 4a
No ratings yet
UNIT 4a
34 pages
Optimizing Search For Fast Query Retrieval in Object Oriented Databases Using Signature Declustering
No ratings yet
Optimizing Search For Fast Query Retrieval in Object Oriented Databases Using Signature Declustering
5 pages
Last - Day - P1 - Quick Revision 2025
No ratings yet
Last - Day - P1 - Quick Revision 2025
20 pages
本科学位论文
100% (1)
本科学位论文
8 pages
Unit 5
No ratings yet
Unit 5
14 pages
Data and Voice Coding
No ratings yet
Data and Voice Coding
20 pages
Data Compression With Huffman Coding: An Efficient Dynamic Implementation Using File Partitioning
No ratings yet
Data Compression With Huffman Coding: An Efficient Dynamic Implementation Using File Partitioning
7 pages
1.3 Compression
No ratings yet
1.3 Compression
7 pages
Why Needed?: Without Compression, These Applications Would Not Be Feasible
No ratings yet
Why Needed?: Without Compression, These Applications Would Not Be Feasible
11 pages
Shree Nila Sparklers: Price List
No ratings yet
Shree Nila Sparklers: Price List
5 pages
11 FM-Index
No ratings yet
11 FM-Index
6 pages
File Management
No ratings yet
File Management
5 pages
Chap15 1473751047 598113
No ratings yet
Chap15 1473751047 598113
34 pages
Unit 1
No ratings yet
Unit 1
15 pages
Unit 5 - Presentation Layer
No ratings yet
Unit 5 - Presentation Layer
8 pages
MIT6 004s09 Tutor01 Sol
No ratings yet
MIT6 004s09 Tutor01 Sol
13 pages
History of File Structures
No ratings yet
History of File Structures
26 pages
Lossless Compression
No ratings yet
Lossless Compression
11 pages
Escondo SG 2
No ratings yet
Escondo SG 2
3 pages
Chapter 19 Set HMI As A MODBUS Server
No ratings yet
Chapter 19 Set HMI As A MODBUS Server
10 pages
The Evolution of Effective B-Tree: Page Organization and Techniques: A Personal Account
No ratings yet
The Evolution of Effective B-Tree: Page Organization and Techniques: A Personal Account
6 pages
S. C. Johnson Bell Laboratories Murray Hill, New Jersey: Case Constant
No ratings yet
S. C. Johnson Bell Laboratories Murray Hill, New Jersey: Case Constant
1 page
Hashed Files Internals
No ratings yet
Hashed Files Internals
22 pages
Aggregate and Verifiably Encrypted Signatures From Bilinear Maps
No ratings yet
Aggregate and Verifiably Encrypted Signatures From Bilinear Maps
22 pages
Cay Dau Hieu Nhi Phan
No ratings yet
Cay Dau Hieu Nhi Phan
9 pages
A Novel Approach of Lossless Image Compression Using Two Techniques
No ratings yet
A Novel Approach of Lossless Image Compression Using Two Techniques
5 pages
CS2040C Final Exam Solutions-2 2020
No ratings yet
CS2040C Final Exam Solutions-2 2020
6 pages
Features of The Hospital ERP System
No ratings yet
Features of The Hospital ERP System
4 pages
Compression
No ratings yet
Compression
2 pages
Comparison of Lossless Data Compression Algorithms
No ratings yet
Comparison of Lossless Data Compression Algorithms
12 pages
Get Windows Registry Troubleshooting 1st Edition Mike Halsey PDF Ebook With Full Chapters Now
No ratings yet
Get Windows Registry Troubleshooting 1st Edition Mike Halsey PDF Ebook With Full Chapters Now
72 pages
Whitepaper-Coreless Transformation - Ins - Neutrinos
No ratings yet
Whitepaper-Coreless Transformation - Ins - Neutrinos
8 pages
CS 300 Data Structures: Sabancı University Faculty of Engineering and Natural Sciences
No ratings yet
CS 300 Data Structures: Sabancı University Faculty of Engineering and Natural Sciences
6 pages
Data Compression
No ratings yet
Data Compression
20 pages
LZW Compression Algorithm
No ratings yet
LZW Compression Algorithm
4 pages
Digital Data Compression
No ratings yet
Digital Data Compression
10 pages
CAcomparative Study
No ratings yet
CAcomparative Study
6 pages
Chapter 5 Data Compression
No ratings yet
Chapter 5 Data Compression
18 pages
Lesson 7
No ratings yet
Lesson 7
19 pages
Unit 6
No ratings yet
Unit 6
8 pages
15 Data Compression: Foundations of Computer Science Cengage Learning
No ratings yet
15 Data Compression: Foundations of Computer Science Cengage Learning
33 pages
Powerful Open Software Key Features
No ratings yet
Powerful Open Software Key Features
2 pages
All Document Reader 1719406177044
No ratings yet
All Document Reader 1719406177044
45 pages
18 - Shanthi - Applying SD-Tree For Object-Oriented Query
No ratings yet
18 - Shanthi - Applying SD-Tree For Object-Oriented Query
12 pages
How Well Do LLM Perform Iin Arithmetic Tasks
No ratings yet
How Well Do LLM Perform Iin Arithmetic Tasks
10 pages
Fast Lempel-ZIV (LZ'78) Algorithm Using Codebook Hashing: Megha Atwal, Lovnish Bansal
No ratings yet
Fast Lempel-ZIV (LZ'78) Algorithm Using Codebook Hashing: Megha Atwal, Lovnish Bansal
4 pages
Logcat 1733226506728
No ratings yet
Logcat 1733226506728
13 pages
Getting Started With VDI
No ratings yet
Getting Started With VDI
1 page
Artifact Fycelium Product Comparison 2020-03-31 1329
No ratings yet
Artifact Fycelium Product Comparison 2020-03-31 1329
3 pages
Minor - PPT Movie Ticket Using Python
No ratings yet
Minor - PPT Movie Ticket Using Python
19 pages
Indexed Sequential File
No ratings yet
Indexed Sequential File
2 pages
Lempel-Ziv-Welch (LZW) - Is A Universal Lossless Data Compression Algorithm Created by Abraham
No ratings yet
Lempel-Ziv-Welch (LZW) - Is A Universal Lossless Data Compression Algorithm Created by Abraham
5 pages
Data Compression
No ratings yet
Data Compression
7 pages
External
No ratings yet
External
15 pages
HPE OneView Startup Installation and Configuration Service Data Sheet-4aa4-2814enw
No ratings yet
HPE OneView Startup Installation and Configuration Service Data Sheet-4aa4-2814enw
5 pages
LegOSC - Mindstorms NXT Robotics Programming For A
No ratings yet
LegOSC - Mindstorms NXT Robotics Programming For A
7 pages
Kushagra Yadav SDE2
No ratings yet
Kushagra Yadav SDE2
1 page
HOMEWORK Lec5-6
No ratings yet
HOMEWORK Lec5-6
5 pages
Chapters 5-6 Summary and Reflection
No ratings yet
Chapters 5-6 Summary and Reflection
5 pages

Unit 3

Uploaded by

Unit 3

Uploaded by

UNIT 3

 Electronic office filing.

1. Text databases are traditionally large.

A brief, qualitative comparison of the signature-based methods versus their competitors is as

1. PC-based, medium size db

2. Write-Once-Read –Many (WORM)

free 001 000 110 010

text 000 010 101 001

block signature 001 010 111 011

False drop probability

Sequential Signature File

1. Compression: If the signature matrix is deliberately sparse, it can be compressed.

File Structure for SSF

Bit-block Compression (BC)

"1", where b is the bit-block size).

Variable Bit-block Compression

An example layout of the message signatures in the VBC method

Bit-Sliced Signature Files (BSSF)

The bit-sliced design is illustrated in Fig below.

Frame-Sliced Signature File

An example for this method:

The Generalized Frame-Sliced Signature File (GFSSF)

 When k = F, n = m, it reduces to the BSSF or B'SSF method.

Data Independent Case

Although elegant, Gustafson’s method suffers from some practical problems:

 Its performance deteriorates as the file grows.

Partitioned signature files

Data Dependent Case

You might also like