Bda 8 59
Bda 8 59
B13/59
Experiement No. 8
Aim:
This document aims to implement a Bloom Filter using the MapReduce programming model. The
Bloom Filter is a probabilistic data structure that efficiently tests whether an element is a member of
a set, with a small chance of false positives but no false negatives.
Theory:
A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a
member of a set. It allows for fast membership testing with a trade-off: the possibility of false
positives (i.e., an element might be wrongly identified as present) but no false negatives (i.e., if an
element is present, it will always be correctly identified).
- Multiple hash functions: Each element is hashed using `k` different hash functions, and the
corresponding positions in a bit array are set to 1.
- Bit array: The Bloom Filter maintains a bit array of `m` bits, initially all set to 0. For each inserted
element, the `k` hash functions set `k` positions to 1.
- If all corresponding bits in the bit array are 1, the element is considered "possibly in the set."
- If any of the bits are 0, the element is definitely not in the set.
2. MapReduce Approach
In the MapReduce model, the Bloom Filter can be distributed across multiple nodes for efficient
parallel construction and querying. Each node handles part of the data and computes the necessary
hash values, updating its portion of the bit array. Once the bit arrays from all mappers are combined,
the Bloom Filter is ready for query operations.
- Map phase: Generates the hash values for each element and updates the corresponding bits in the
bit array.
- Reduce phase: Aggregates the bit arrays from all mappers to form the final global bit array.
Code
1. Hash Functions
We'll use multiple hash functions to determine the bit positions for each element. For simplicity,
Python's built-in hash function can be combined with variations to simulate multiple hash functions.
```python
!/usr/bin/env python
import sys
import hashlib
hashes = []
for i in range(num_hashes):
Create different hash values using hashlib and mod them to fit the bit array
hashes.append(hash_value % bit_array_size)
return hashes
Parameters
line = line.strip()
print(f'{pos}\t1')
import sys
BIT_ARRAY_SIZE = 1000
line = line.strip()
position, _ = line.split('\t')
bit_array[int(position)] = 1
print(''.join(map(str, bit_array)))
```
Once the Bloom Filter is built, you can query it to check if an element might be in the set.
import sys
import hashlib
hashes = []
for i in range(num_hashes):
hashes.append(hash_value % bit_array_size)
return hashes
Parameters
bloom_filter = sys.stdin.read().strip()
element_to_check = sys.argv[1]
Check if all the corresponding bits in the Bloom Filter are set to 1
if is_present:
else:
Conclusion:
Implementing a Bloom Filter using the MapReduce framework allows for efficient parallel processing
and distributed storage of large datasets. The Bloom Filter is a valuable tool when it is important to
perform membership queries with low memory overhead, such as in web caches, databases, and
spam filters.