0% found this document useful (0 votes)
39 views7 pages

BDA Experiment 7

Uploaded by

pabocon672
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views7 pages

BDA Experiment 7

Uploaded by

pabocon672
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

EXPERIMENT NO:07

AIM: Implementing DGIM algorithm using any Programming Language / Implementing


Bloom Filter using any Programming Language.

THEORY:

What is DGIM Algorithm?

Suppose we have a window of length N on a binary stream. We want at all times to be able to
answer queries of the form “how many 1’s are there in the last k bits?” for any k≤ N. For this
purpose we use the DGIM algorithm.
The basic version of the algorithm uses O(log2 N) bits to represent a window of N bits, and
allows us to estimate the number of 1’s in the window with an error of no more than 50%. To
begin, each bit of the stream has a timestamp, the position in which it arrives. The first bit has
timestamp 1, the second has timestamp 2, and so on.
Since we only need to distinguish positions within the window of length N, we shall represent
timestamps modulo N, so they can be represented by log2 N bits. If we also store the total
number of bits ever seen in the stream (i.e., the most recent timestamp) modulo N, then we can
determine from a timestamp modulo N where in the current window the bit with that timestamp
is.

We divide the window into buckets, 5 consisting of:


1. The timestamp of its right (most recent) end.
2. The number of 1’s in the bucket. This number must be a power of 2, and we refer to the
number of 1’s as the size of the bucket.
To represent a bucket, we need log2 N bits to represent the timestamp (modulo N) of its right
end. To represent the number of 1’s we only need log2 log2 N bits. The reason is that we know
this number i is a power of 2, say 2j , so we can represent i by coding j in binary. Since j is at
most log2 N, it requires log2 log2 N bits. Thus, O(logN) bits suffice to represent a bucket.
There are six rules that must be followed when representing a stream by buckets.
3. The right end of a bucket is always a position with a 1.
4. Every position with a 1 is in some bucket.
5. No position is in more than one bucket.
6. There are one or two buckets of any given size, up to some maximum size.
7. All sizes must be a power of 2.
8. Buckets cannot decrease in size as we move to the left (back in time).
What is Bloom Filter?

A Bloom filter is a space-efficient probabilistic data structure that is used to test whether an
element is a member of a set. For example, checking availability of username is set membership
problem, where the set is the list of all registered username. The price we pay for efficiency is
that it is probabilistic in nature that means, there might be some False Positive results. False
positive means, it might tell that given username is already taken but actually it’s not.

Working of Bloom Filter


A empty bloom filter is a bit array of m bits, all set to zero, like this –

We need k number of hash functions to calculate the hashes for a given input. When we want
to add an item in the filter, the bits at k indices h1(x), h2(x), … hk(x) are set, where indices are
calculated using hash functions.

DGIM PROGRAM:

import math filename


= "test.txt"

container = {}
windowsize = 1000
timestamp = 0
updateinterval = 1000# no larger than the windowsize
updateindex = 0

keysnum = int(math.log(windowsize, 2)) + 1 keylist


= list()
# initialize the container for
i in range(keysnum): key =
int(math.pow(2, i))
keylist.append(key)
container[key] = list()

def UpdateContainer(inputdict, klist, numkeys):


for key in klist:
if len(inputdict[key]) > 2:
inputdict[key].pop(0) tstamp
= inputdict[key].pop(0) if key
!= klist[-1]:
inputdict[key * 2].append(tstamp)
else:
break

def OutputResult(inputdict, klist, wsize):


cnt = 0
firststamp = 0
for key in klist:
if len(inputdict[key]) > 0: firststamp
= inputdict[key][0]
for tstamp in inputdict[key]: print "size of bucket: %d,
timestamp: %d" % (key, tstamp)
for key in klist:
for tstamp in inputdict[key]:
if tstamp != firststamp:
cnt += key
else:
cnt += 0.5 * key
print "Estimated number of ones in the last %d bits: %d" % (wsize, cnt)

with open(filename, 'r') as sfile:


while True: char =
sfile.read(1) if not char:#
no more input
OutputResult(container, keylist, windowsize)
break
timestamp = (timestamp + 1) % windowsize
for k in container.iterkeys():
for itemstamp in container[k]: if itemstamp == timestamp:# remove
record which is out of the window
container[k].remove(itemstamp)
if char == "1":# add it to the container
container[1].append(timestamp)
UpdateContainer(container, keylist, keysnum)
updateindex = (updateindex + 1) % updateinterval
if updateindex == 0:
OutputResult(container, keylist, windowsize)
print "\n"
OUTPUT:

BLOOM FILTER PROGRAM:

import math import


mmh3
from bitarray import bitarray

class BloomFilter(object):
'''
Class for Bloom filter, using murmur3 hash function
'''
def init (self, items_count,fp_prob):
'''
items_count : int
Number of items expected to be stored in bloom filter
fp_prob : float
False Positive probability in decimal
'''
# False posible probability in decimal
self.fp_prob = fp_prob

# Size of bit array to use


self.size = self.get_size(items_count,fp_prob)

# number of hash functions to use


self.hash_count = self.get_hash_count(self.size,items_count)

# Bit array of given size


self.bit_array = bitarray(self.size)

# initialize all bits as 0 self.bit_array.setall(0)


def add(self, item):
'''
Add an item in the filter
'''
digests = [] for i in
range(self.hash_count):

# create digest for given item.


# i work as seed to mmh3.hash() function #
With different seed, digest created is different
digest = mmh3.hash(item,i) % self.size
digests.append(digest)

# set the bit True in bit_array


self.bit_array[digest] = True

def check(self, item):


'''
Check for existence of an item in filter
'''
for i in range(self.hash_count): digest =
mmh3.hash(item,i) % self.size if
self.bit_array[digest] == False:

# if any of bit is False then,its not present


# in filter
# else there is probability that it exist
return False
return True @classmethod
def get_size(self,n,p):
'''
Return the size of bit array(m) to used using
following formula m = -(n
* lg(p)) / (lg(2)^2)
n : int number of items expected to be stored in
filter
p : float
False Positive probability in decimal
'''
m = -(n * math.log(p))/(math.log(2)**2)
return int(m) @classmethod
def get_hash_count(self, m,
n):
'''
Return the hash function(k) to be used
using following formula k = (m/n) * lg(2) m
: int size of bit array
n : int
number of items expected to be stored in filter
'''
k = (m/n) * math.log(2)
return int(k)
from bloomfilter import BloomFilter from random
import shuffle n = 20 #no of items to add p = 0.05
#false positive probability bloomf = BloomFilter(n,p)
print("Size of bit array:%d"%bloomf.size) print("False
positive Probability:%d"%bloomf.fp_prob)
print("Number of hash functions:%d"%bloomf.hash_count)

# words to be added
word_present = ['abound','abounds','abundance','abundant','accessable',
'bloom','blossom','bolster','bonny','bonus','bonuses',
'coherent','cohesive','colorful','comely','comfort',
'gems','generosity','generous','generously','genial']
# word not added
word_absent = ['bluff','cheater','hate','war','humanity',
'racism','hurt','nuke','gloomy','facebook',
'geeksforgeeks','twitter']

for item in word_present:


bloomf.add(item)

shuffle(word_present)
shuffle(word_absent)

test_words = word_present[:10] + word_absent


shuffle(test_words) for
word in test_words:
if bloomf.check(word):
if word in word_absent: print("'%s' is a
false positive!"%word)
else: print("'%s' is probably
present!"%word)
else:
print("'%s' is definitely not present!"%word)

OUTPUT:
Conclusion:
We have successfully implemented DGIM algorithm and Bloom Filter

Name: Huzaif Shaikh


Roll No. : 53
Date: 30th Sept,2024

Marks: Signature of Supervisor

You might also like