0% found this document useful (0 votes)
5 views4 pages

Bda Exp8

The document outlines the Flajolet-Martin Algorithm, a probabilistic method for counting distinct elements in stream data, which utilizes a hash function to map data to binary strings and estimates unique counts based on trailing zeros. It details the algorithm's steps, accuracy factors, and provides an example with code implementation in Python. The conclusion emphasizes the algorithm's effectiveness in handling large datasets and the impact of varying hash functions and stream sizes on distinct count estimation.

Uploaded by

jpurva23ecs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

Bda Exp8

The document outlines the Flajolet-Martin Algorithm, a probabilistic method for counting distinct elements in stream data, which utilizes a hash function to map data to binary strings and estimates unique counts based on trailing zeros. It details the algorithm's steps, accuracy factors, and provides an example with code implementation in Python. The conclusion emphasizes the algorithm's effectiveness in handling large datasets and the impact of varying hash functions and stream sizes on distinct count estimation.

Uploaded by

jpurva23ecs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 4

Experiment No-08

Aim: To implement Flajolet-Martin Algorithm for counting distinct elements in Stream Data.

Prerequisite : Ensure that python IDE is installed, configured and is running.

Theory:

The Flajolet-Martin algorithm is also known as probabilistic algorithm which is mainly used to count the number of unique elements in a stream
or database . This algorithm was invented by Philippe Flajolet and G. Nigel Martin in 1983 and since then it has been used in various
applications such as , data mining and database management.

The basic idea to which Flajolet-Martin algorithm is based on is to use a hash function to map the elements in the given dataset to a binary string,
and to make use of the length of the longest null sequence in the binary string as an estimator for the number of unique elements to use as a value
element.

The steps for the Flajolet-Martin algorithm are:

 First step is to choose a hash function that can be used to map the elements in the database to fixed-length binary strings. The length of
the binary string can be chosen based on the accuracy desired.

 Next step is to apply the hash function to each data item in the dataset to get its binary string representation.

 Next step includes determinig the position of the rightmost zero in each binary string.

 Next we compute the maximum position of the rightmost zero for all binary strings.

 Now we estimate the number of distinct elements in the dataset as 2 to the power of the maximum position of the rightmost zero which
we calculated in previous step.

The accuracy of Flajolet Martin Algorithm is determined by the length of the binary strings and the number of hash functions it uses. Generally,
with increse in the length of the binary strings or using more hash functions in algorithm can often increase the algorithm’s accuracy.

The Flajolet Martin Algorithm is especially used for big datasets that cannot be kept in memory or analysed with regular methods. This
algorithm , by using good probabilistic techniques, can provide a precise estimate of the number of unique elements in the data set by using less
computing.

Intuition:-

If we had a good, random hash function that acted on strings and generated integers, what can we

say about the generated integers? Since they are random themselves, we would expect:

1. 1/2 of them to have their binary representation end in 0 (i.e. divisible by 2),

2. 1/4 of them to have their binary representation end in 00 (i.e. divisible by 4)

3. 1/8 of them to have their binary representation end in 000 (i.e. divisible by

8) and in general, 1/2n of them to have their binary representation end in 0n.

Turning the problem around, if the hash function generated an integer ending in 0m bits (and it also

generated integers ending in 0m−1 bits, 0m−2 bits, ..., 01 bits), intuitively, the number of unique strings

is around 2m.

To facilitate the above, this algorithm maintains 1 bit for each 0i seen - i.e. 1 bit for 0, another for

00, another for 000, and so on. The output of the algorithm is based on the maximum of

consecutive 0i seen.

Algorithm:

1. Pick a hash function h that maps each of the n elements to at least log2n bits.

2. For each stream element a, let r(a) be the number of trailing 0’s in h(a).

3. Called the tail length.

4. Example: 000101 has tail length 0; 101000 has tail length 3.

5. Record R = the maximum r(a) seen for any a in the stream.

6. Estimate (based on this hash function) = 2^R.


Example:-

a. Consider the stream ={10,12,13,3,4,25,7,10,4}

b. Let the hash function be h(a)=(3x+1)%5.

c. Calculate the hash value of the stream and its binary representation.

i.h(10) =(3(10)+1)%5 = 31%5= 1 (001)

ii. h(12) = (3(12)+1)%5 = 37%5 = 2 (010)

iii. h(13) = (3(13)+1)%5 = 40%5 = 0 (000)

iv. h(3) = (3(3)+1)%5 = 10%5 = 0 (000)

v. h(4) = (3(4)+1)%5 = 13%5 = 3 (011)

vi. h(25) = (3(25)+1)%5 = 76%5 = 1 (001)

vii. h(7) = (3(7)+1)%5 = 22%5 = 2 (010)

viii. h(10) = (3(10)+1)%5 = 31%5 = 1 (001)

ix. h(4) = (3(4)+1)%5 = 13%5 = 3 (011)

d. Trailing Zeros(R)={0,1,0,0,0,0,1,0,0}

e. Max of R=1

f. Distinct count=2^R=2^1=2

Code:

import hashlib

import pandas as pd

def hash_function(x):

"""Simple hash function using multiplication and modulo."""

return (x % 11) # Modify if needed

def count_trailing_zeros(binary_str):

"""Count trailing zeros in a binary string."""

return len(binary_str) - len(binary_str.rstrip('0'))

# User input for stream size

stream_size = int(input("Enter the stream size: "))

# Taking input for stream elements

stream_data = list(map(int, input("Enter the stream elements separated by space: ").split()))

# Check if the entered elements match the stream size

if len(stream_data) != stream_size:

print("Error: The number of elements does not match the specified stream size.")

exit()

# Process stream

results = []

max_trailing_zeros = 0

for num in stream_data:

h_value = hash_function(num)

binary_value = bin(h_value)[2:] # Convert to binary string

trailing_zeros = count_trailing_zeros(binary_value)
max_trailing_zeros = max(max_trailing_zeros, trailing_zeros)

results.append([num, h_value, binary_value, trailing_zeros])

# Estimate final count

estimated_count = 2 ** max_trailing_zeros

# Display results in table format

df = pd.DataFrame(results, columns=["Stream Value", "Hash Value", "Binary", "Trailing Zeros"])

print(df)

print(f"\nEstimated number of distinct elements: {estimated_count}")

output :

1] with hash function x% 11 with stream size 6

With stream size 10

With stream size 15

2] hash function x % 15 with stream size 6


different hash function (x*17)%31 with same stream

Different hash function (x * 2 ) % 30

B.4 Conclusion: Hence , we learned about Flajolet-Martin Algorithm for counting distinct elements in Stream Data.

With variation 1 : hash function are same and if we change stream size then distinct values are same.

With variation 2 : not possible to take exact distinct counting for same stream but different hash function ,so go for 2 hash function method.

You might also like