Bloom Filters: Differential Files Simple Large Database
Bloom Filters: Differential Files Simple Large Database
Operations.
Retrieve. Update.
Insert a new record. Make changes to an existing record. Delete a record.
Problems.
Index and File change with time. Sooner or later, system will crash. Recovery =>
Copy Master File (MF) from backup. Copy Master Index (MI) from backup. Process all transactions since last backup.
Index
File
Differential File
Make no changes to master file. Alter index and write updated record to a new file called differential file.
Advantage.
DF is smaller than File and so may be backed up more frequently. Index needs to be backed up whenever DF is. So, index should be small as well. Recovery time is reduced.
Index
File
DF
Ans.
Disadvantage.
Eventually DF becomes large and can no longer be backed up with desired frequency. Must integrate File and DF now. Following integration, DF is empty.
Index
File
DF
Ans.
Large Index.
Index cannot be backed up as frequently as desired. Time to recover current state of index & DF is excessive. Use a differential index.
Make no changes to Index. DI is an index to all deleted records and records in DF.
Index
File
DF
Ans.
Performance hit.
Most queries search both DI and Index. Increase in # of disk accesses/query.
DI N Y Index
File Ans.
DF
Ideal Filter
Key Y
Filter N Y Index
DI
File Ans.
DF
Y => this key is in the DI. N => this key is not in the DI. Functionality of ideal filter is same as that of DI. So, a filter that eliminates performance hit of DI doesnt exist.
BF N
DI
N => this key is not in the DI. M (maybe) => this key may be in the DI. Filter error.
BF says Maybe. DI says No.
File Ans.
DF
Filter error.
DI
BF N
File Ans.
DF
BF resides in memory. Performance hit paid only when there is a filter error.
B1 B2 B3 BW
On Chip
H1 H2 H3 HW
Off Chip
Example
00 1001 001 001 000 0123456789
m = 11 (normally, m would be much much larger). h = 2 (2 hash functions). f1(k) = k mod m. f2(k) = (2k) mod m. k = 15. k = 17.
Example
00 1001 001 001 000 0123456789
Optimal Choice Of h
Probability of a filter error depends on:
Filter size m. # of hash functions h. # of updates before filter is reset to 0 u.
Insert Delete Change
Optimal h
p(u) ~ eu/n(1 euh/m )h
p(u)
h
h ~ 2.772 Use h = 2 or h = 3.
m = 2*106, u = 106/2