BDA Experiment 7
BDA Experiment 7
THEORY:
Suppose we have a window of length N on a binary stream. We want at all times to be able to
answer queries of the form “how many 1’s are there in the last k bits?” for any k≤ N. For this
purpose we use the DGIM algorithm.
The basic version of the algorithm uses O(log2 N) bits to represent a window of N bits, and
allows us to estimate the number of 1’s in the window with an error of no more than 50%. To
begin, each bit of the stream has a timestamp, the position in which it arrives. The first bit has
timestamp 1, the second has timestamp 2, and so on.
Since we only need to distinguish positions within the window of length N, we shall represent
timestamps modulo N, so they can be represented by log2 N bits. If we also store the total
number of bits ever seen in the stream (i.e., the most recent timestamp) modulo N, then we can
determine from a timestamp modulo N where in the current window the bit with that timestamp
is.
A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an
element is a member of a set. For example, checking availability of username is set membership
problem, where the set is the list of all registered username. The price we pay for efficiency is
that it is probabilistic in nature that means, there might be some False Positive results. False
positive means, it might tell that given username is already taken but actually it’s not.
We need k number of hash functions to calculate the hashes for a given input. When we want
to add an item in the filter, the bits at k indices h1(x), h2(x), … hk(x) are set, where indices are
calculated using hash functions.
DGIM PROGRAM:
container = {}
windowsize = 1000
timestamp = 0
updateinterval = 1000# no larger than the windowsize
updateindex = 0
class BloomFilter(object):
'''
Class for Bloom filter, using murmur3 hash function
'''
def init (self, items_count,fp_prob):
'''
items_count : int
Number of items expected to be stored in bloom filter
fp_prob : float
False Positive probability in decimal
'''
# False posible probability in decimal
self.fp_prob = fp_prob
# words to be added
word_present = ['abound','abounds','abundance','abundant','accessable',
'bloom','blossom','bolster','bonny','bonus','bonuses',
'coherent','cohesive','colorful','comely','comfort',
'gems','generosity','generous','generously','genial']
# word not added
word_absent = ['bluff','cheater','hate','war','humanity',
'racism','hurt','nuke','gloomy','facebook',
'geeksforgeeks','twitter']
shuffle(word_present)
shuffle(word_absent)
OUTPUT:
Conclusion:
We have successfully implemented DGIM algorithm and Bloom Filter