0% found this document useful (0 votes)
14 views4 pages

Bda 8 59

This document outlines the implementation of a Bloom Filter using the MapReduce programming model, emphasizing its efficiency in testing set membership with a low chance of false positives. It details the construction of the Bloom Filter using multiple hash functions and a bit array, along with the MapReduce approach for distributed processing. The document includes code examples for the mapper, reducer, and query implementations, highlighting the utility of Bloom Filters in applications like web caches and spam filters.

Uploaded by

pjib225
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views4 pages

Bda 8 59

This document outlines the implementation of a Bloom Filter using the MapReduce programming model, emphasizing its efficiency in testing set membership with a low chance of false positives. It details the construction of the Bloom Filter using multiple hash functions and a bit array, along with the MapReduce approach for distributed processing. The document includes code examples for the mapper, reducer, and query implementations, highlighting the utility of Bloom Filters in applications like web caches and spam filters.

Uploaded by

pjib225
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Meet Laheri

B13/59

Experiement No. 8
Aim:

This document aims to implement a Bloom Filter using the MapReduce programming model. The
Bloom Filter is a probabilistic data structure that efficiently tests whether an element is a member of
a set, with a small chance of false positives but no false negatives.

Theory:

1. Bloom Filter Overview

A Bloom Filter is a space-efficient probabilistic data structure used to test whether an element is a
member of a set. It allows for fast membership testing with a trade-off: the possibility of false
positives (i.e., an element might be wrongly identified as present) but no false negatives (i.e., if an
element is present, it will always be correctly identified).

The Bloom Filter is constructed using:

- Multiple hash functions: Each element is hashed using `k` different hash functions, and the
corresponding positions in a bit array are set to 1.

- Bit array: The Bloom Filter maintains a bit array of `m` bits, initially all set to 0. For each inserted
element, the `k` hash functions set `k` positions to 1.

When checking if an element exists in the filter:

- The element is hashed using the same `k` hash functions.

- If all corresponding bits in the bit array are 1, the element is considered "possibly in the set."

- If any of the bits are 0, the element is definitely not in the set.

2. MapReduce Approach

In the MapReduce model, the Bloom Filter can be distributed across multiple nodes for efficient
parallel construction and querying. Each node handles part of the data and computes the necessary
hash values, updating its portion of the bit array. Once the bit arrays from all mappers are combined,
the Bloom Filter is ready for query operations.

The implementation consists of:

- Map phase: Generates the hash values for each element and updates the corresponding bits in the
bit array.

- Reduce phase: Aggregates the bit arrays from all mappers to form the final global bit array.

Code

1. Hash Functions

We'll use multiple hash functions to determine the bit positions for each element. For simplicity,
Python's built-in hash function can be combined with variations to simulate multiple hash functions.

2. Mapper and Reducer Implementation

- Mapper Code (bloom_mapper.py):


Meet Laheri
B13/59

```python

!/usr/bin/env python

import sys

import hashlib

Define a function for multiple hash functions

def get_hashes(item, num_hashes, bit_array_size):

hashes = []

for i in range(num_hashes):

Create different hash values using hashlib and mod them to fit the bit array

hash_value = int(hashlib.md5(f'{item}{i}'.encode()).hexdigest(), 16)

hashes.append(hash_value % bit_array_size)

return hashes

Parameters

NUM_HASHES = 3 Number of hash functions

BIT_ARRAY_SIZE = 1000 Size of bit array

Input comes from standard input (stdin)

for line in sys.stdin:

line = line.strip()

Generate the hash positions for the item

hash_positions = get_hashes(line, NUM_HASHES, BIT_ARRAY_SIZE)

Output the hash positions for the line (element)

for pos in hash_positions:

print(f'{pos}\t1')

- Reducer Code (bloom_reducer.py):

import sys

Initialize a bit array of 0s

BIT_ARRAY_SIZE = 1000

bit_array = [0] * BIT_ARRAY_SIZE


Meet Laheri
B13/59

Input comes from standard input (stdin)

for line in sys.stdin:

line = line.strip()

position, _ = line.split('\t')

Update the bit array at the given position

bit_array[int(position)] = 1

Output the final bit array as a compressed string

print(''.join(map(str, bit_array)))

```

3. Bloom Filter Query Implementation

Once the Bloom Filter is built, you can query it to check if an element might be in the set.

- Query Code (bloom_query.py):

import sys

import hashlib

Function to compute hash positions for querying

def get_hashes(item, num_hashes, bit_array_size):

hashes = []

for i in range(num_hashes):

hash_value = int(hashlib.md5(f'{item}{i}'.encode()).hexdigest(), 16)

hashes.append(hash_value % bit_array_size)

return hashes

Parameters

NUM_HASHES = 3 Number of hash functions

BIT_ARRAY_SIZE = 1000 Size of bit array

Read the Bloom Filter bit array from input

bloom_filter = sys.stdin.read().strip()

Input element to check


Meet Laheri
B13/59

element_to_check = sys.argv[1]

Compute hash positions for the element to check

hash_positions = get_hashes(element_to_check, NUM_HASHES, BIT_ARRAY_SIZE)

Check if all the corresponding bits in the Bloom Filter are set to 1

is_present = all(bloom_filter[pos] == '1' for pos in hash_positions)

if is_present:

print(f'{element_to_check} is possibly in the set.')

else:

print(f'{element_to_check} is definitely not in the set.')

Conclusion:

Implementing a Bloom Filter using the MapReduce framework allows for efficient parallel processing
and distributed storage of large datasets. The Bloom Filter is a valuable tool when it is important to
perform membership queries with low memory overhead, such as in web caches, databases, and
spam filters.

You might also like