0% found this document useful (0 votes)

8 views4 pages

Bda Exp8

The document outlines the Flajolet-Martin Algorithm, a probabilistic method for counting distinct elements in stream data, which utilizes a hash function to map data to binary strings and estimates unique counts based on trailing zeros. It details the algorithm's steps, accuracy factors, and provides an example with code implementation in Python. The conclusion emphasizes the algorithm's effectiveness in handling large datasets and the impact of varying hash functions and stream sizes on distinct count estimation.

Uploaded by

jpurva23ecs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views4 pages

Bda Exp8

Uploaded by

jpurva23ecs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 4

Experiment No-08

Aim: To implement Flajolet-Martin Algorithm for counting distinct elements in Stream Data.

Prerequisite : Ensure that python IDE is installed, configured and is running.

Theory:

The Flajolet-Martin algorithm is also known as probabilistic algorithm which is mainly used to count the number of unique elements in a stream
or database . This algorithm was invented by Philippe Flajolet and G. Nigel Martin in 1983 and since then it has been used in various
applications such as , data mining and database management.

The basic idea to which Flajolet-Martin algorithm is based on is to use a hash function to map the elements in the given dataset to a binary string,
and to make use of the length of the longest null sequence in the binary string as an estimator for the number of unique elements to use as a value
element.

The steps for the Flajolet-Martin algorithm are:

 First step is to choose a hash function that can be used to map the elements in the database to fixed-length binary strings. The length of
the binary string can be chosen based on the accuracy desired.

 Next step is to apply the hash function to each data item in the dataset to get its binary string representation.

 Next step includes determinig the position of the rightmost zero in each binary string.

 Next we compute the maximum position of the rightmost zero for all binary strings.

 Now we estimate the number of distinct elements in the dataset as 2 to the power of the maximum position of the rightmost zero which
we calculated in previous step.

The accuracy of Flajolet Martin Algorithm is determined by the length of the binary strings and the number of hash functions it uses. Generally,
with increse in the length of the binary strings or using more hash functions in algorithm can often increase the algorithm’s accuracy.

The Flajolet Martin Algorithm is especially used for big datasets that cannot be kept in memory or analysed with regular methods. This
algorithm , by using good probabilistic techniques, can provide a precise estimate of the number of unique elements in the data set by using less
computing.

Intuition:-

If we had a good, random hash function that acted on strings and generated integers, what can we

say about the generated integers? Since they are random themselves, we would expect:

1. 1/2 of them to have their binary representation end in 0 (i.e. divisible by 2),

2. 1/4 of them to have their binary representation end in 00 (i.e. divisible by 4)

3. 1/8 of them to have their binary representation end in 000 (i.e. divisible by

8) and in general, 1/2n of them to have their binary representation end in 0n.

Turning the problem around, if the hash function generated an integer ending in 0m bits (and it also

generated integers ending in 0m−1 bits, 0m−2 bits, ..., 01 bits), intuitively, the number of unique strings

is around 2m.

To facilitate the above, this algorithm maintains 1 bit for each 0i seen - i.e. 1 bit for 0, another for

00, another for 000, and so on. The output of the algorithm is based on the maximum of

consecutive 0i seen.

Algorithm:

1. Pick a hash function h that maps each of the n elements to at least log2n bits.

2. For each stream element a, let r(a) be the number of trailing 0’s in h(a).

3. Called the tail length.

4. Example: 000101 has tail length 0; 101000 has tail length 3.

5. Record R = the maximum r(a) seen for any a in the stream.

6. Estimate (based on this hash function) = 2^R.

Example:-

a. Consider the stream ={10,12,13,3,4,25,7,10,4}

b. Let the hash function be h(a)=(3x+1)%5.

c. Calculate the hash value of the stream and its binary representation.

i.h(10) =(3(10)+1)%5 = 31%5= 1 (001)

ii. h(12) = (3(12)+1)%5 = 37%5 = 2 (010)

iii. h(13) = (3(13)+1)%5 = 40%5 = 0 (000)

iv. h(3) = (3(3)+1)%5 = 10%5 = 0 (000)

v. h(4) = (3(4)+1)%5 = 13%5 = 3 (011)

vi. h(25) = (3(25)+1)%5 = 76%5 = 1 (001)

vii. h(7) = (3(7)+1)%5 = 22%5 = 2 (010)

viii. h(10) = (3(10)+1)%5 = 31%5 = 1 (001)

ix. h(4) = (3(4)+1)%5 = 13%5 = 3 (011)

d. Trailing Zeros(R)={0,1,0,0,0,0,1,0,0}

e. Max of R=1

f. Distinct count=2^R=2^1=2

Code:

import hashlib

import pandas as pd

def hash_function(x):

"""Simple hash function using multiplication and modulo."""

return (x % 11) # Modify if needed

def count_trailing_zeros(binary_str):

"""Count trailing zeros in a binary string."""

return len(binary_str) - len(binary_str.rstrip('0'))

# User input for stream size

stream_size = int(input("Enter the stream size: "))

# Taking input for stream elements

stream_data = list(map(int, input("Enter the stream elements separated by space: ").split()))

# Check if the entered elements match the stream size

if len(stream_data) != stream_size:

print("Error: The number of elements does not match the specified stream size.")

exit()

# Process stream

results = []

max_trailing_zeros = 0

for num in stream_data:

h_value = hash_function(num)

binary_value = bin(h_value)[2:] # Convert to binary string

trailing_zeros = count_trailing_zeros(binary_value)
max_trailing_zeros = max(max_trailing_zeros, trailing_zeros)

results.append([num, h_value, binary_value, trailing_zeros])

# Estimate final count

estimated_count = 2 ** max_trailing_zeros

# Display results in table format

df = pd.DataFrame(results, columns=["Stream Value", "Hash Value", "Binary", "Trailing Zeros"])

print(df)

print(f"\nEstimated number of distinct elements: {estimated_count}")

output :

1] with hash function x% 11 with stream size 6

With stream size 10

With stream size 15

2] hash function x % 15 with stream size 6

different hash function (x*17)%31 with same stream

Different hash function (x * 2 ) % 30

B.4 Conclusion: Hence , we learned about Flajolet-Martin Algorithm for counting distinct elements in Stream Data.

With variation 1 : hash function are same and if we change stream size then distinct values are same.

With variation 2 : not possible to take exact distinct counting for same stream but different hash function ,so go for 2 hash function method.

Information Management Reviewer
No ratings yet
Information Management Reviewer
13 pages
AWS Certified Solutions Architect - Associate SAA-C02
No ratings yet
AWS Certified Solutions Architect - Associate SAA-C02
15 pages
Experiment No 8
No ratings yet
Experiment No 8
7 pages
Flajolet-Martin Algorithm
No ratings yet
Flajolet-Martin Algorithm
28 pages
Unit 4 - 4.4
No ratings yet
Unit 4 - 4.4
23 pages
FM Algorithm
No ratings yet
FM Algorithm
3 pages
Bda Exp5 Chinmay
No ratings yet
Bda Exp5 Chinmay
3 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Lec1 Bloom Distinctcount
No ratings yet
Lec1 Bloom Distinctcount
76 pages
Blooms Filter
No ratings yet
Blooms Filter
15 pages
Estimating Distinct Elements Using Flajolet-Martin Algorithm On A Data Stream
No ratings yet
Estimating Distinct Elements Using Flajolet-Martin Algorithm On A Data Stream
3 pages
DSBD Unit-II 3
No ratings yet
DSBD Unit-II 3
28 pages
Assignment No.2: HOANG Nguyen Phong
No ratings yet
Assignment No.2: HOANG Nguyen Phong
6 pages
Bda PT 2
No ratings yet
Bda PT 2
35 pages
HW 2 Sol
No ratings yet
HW 2 Sol
5 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
Counting Distinct Elements in A Stream
No ratings yet
Counting Distinct Elements in A Stream
4 pages
Probabilistic Counting Algorithms For Database Applications - Flajolet
No ratings yet
Probabilistic Counting Algorithms For Database Applications - Flajolet
28 pages
MMD 05
No ratings yet
MMD 05
50 pages
Streaming Algorithms Complete
No ratings yet
Streaming Algorithms Complete
10 pages
Viden Io Data Analytics Lecture8 Counting Distinct Elements PDF
No ratings yet
Viden Io Data Analytics Lecture8 Counting Distinct Elements PDF
13 pages
Bloom FIlter and Hash Function Numericals
No ratings yet
Bloom FIlter and Hash Function Numericals
6 pages
Unit 3
No ratings yet
Unit 3
49 pages
BDA Experiment 7
No ratings yet
BDA Experiment 7
7 pages
DGIM
No ratings yet
DGIM
90 pages
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
No ratings yet
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
28 pages
Streams 2
No ratings yet
Streams 2
49 pages
FM Algorithm Theory Explanation
No ratings yet
FM Algorithm Theory Explanation
2 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
BigdataFinal
No ratings yet
BigdataFinal
13 pages
Algorithms For Massive Data Problems
No ratings yet
Algorithms For Massive Data Problems
28 pages
3.flajolet Martin Algorithm
No ratings yet
3.flajolet Martin Algorithm
31 pages
Manual Bda 6 7 8
No ratings yet
Manual Bda 6 7 8
6 pages
Assignment 5-Fall 2024 - 553
No ratings yet
Assignment 5-Fall 2024 - 553
8 pages
Homework Assignment 2: Total Points 80
No ratings yet
Homework Assignment 2: Total Points 80
2 pages
DA Numericals
No ratings yet
DA Numericals
15 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
Approximate Frequency Counts Over Data Streams
No ratings yet
Approximate Frequency Counts Over Data Streams
87 pages
Presentation On Counting Frequent Itemsets
No ratings yet
Presentation On Counting Frequent Itemsets
13 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
01 Streaming PDF
No ratings yet
01 Streaming PDF
8 pages
Unit 2 Mathematical Foundation of Big Data: - Syllabus
No ratings yet
Unit 2 Mathematical Foundation of Big Data: - Syllabus
26 pages
A Simple Algorithm For Finding Frequent Elements in Streams and Bags
No ratings yet
A Simple Algorithm For Finding Frequent Elements in Streams and Bags
5 pages
Estimating Frequency Moments of Data Streams Using Random Linear Combinations
No ratings yet
Estimating Frequency Moments of Data Streams Using Random Linear Combinations
12 pages
AdityaGaur BDA Exp7
No ratings yet
AdityaGaur BDA Exp7
2 pages
Bda 8 59
No ratings yet
Bda 8 59
4 pages
Bloom Filter
No ratings yet
Bloom Filter
29 pages
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Mining Data Streams
No ratings yet
Mining Data Streams
34 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
6 Filtering and Streaming: 6.1 Bloom Filters
No ratings yet
6 Filtering and Streaming: 6.1 Bloom Filters
6 pages
Streaming Algorithm
No ratings yet
Streaming Algorithm
16 pages
L11 PDF
No ratings yet
L11 PDF
5 pages
Probabilistic Data Structures
No ratings yet
Probabilistic Data Structures
26 pages
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
No ratings yet
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
26 pages
Assocrules 2
No ratings yet
Assocrules 2
49 pages
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
No ratings yet
Mining Data Streams (Part 2) : Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman
46 pages
Module 3 Mining Data Streams
No ratings yet
Module 3 Mining Data Streams
96 pages
Introduction To Randomized Algorithms
No ratings yet
Introduction To Randomized Algorithms
18 pages
Collections
No ratings yet
Collections
7 pages
Exp 5 Bdafinal
No ratings yet
Exp 5 Bdafinal
7 pages
Bda exp6finAL
No ratings yet
Bda exp6finAL
8 pages
Experiment No - 1 Bda
No ratings yet
Experiment No - 1 Bda
10 pages
Exp 5 Bda
No ratings yet
Exp 5 Bda
9 pages
Bank and ATM SimulatorProject
No ratings yet
Bank and ATM SimulatorProject
89 pages
Nishikawa-Pacher - Who Are The 100 Largest Scientific Publishers by Journal Count
No ratings yet
Nishikawa-Pacher - Who Are The 100 Largest Scientific Publishers by Journal Count
14 pages
MERGE SQL Statement: Lesser Known Facets: Andrej Pashchenko
No ratings yet
MERGE SQL Statement: Lesser Known Facets: Andrej Pashchenko
35 pages
First Login To The System
No ratings yet
First Login To The System
22 pages
Remote Function Calls
No ratings yet
Remote Function Calls
6 pages
Informatica Interview Questions
No ratings yet
Informatica Interview Questions
153 pages
Sample Proposal 4
No ratings yet
Sample Proposal 4
30 pages
Oracle Recovery
No ratings yet
Oracle Recovery
2 pages
Parallel Cursor Method in Abap
No ratings yet
Parallel Cursor Method in Abap
3 pages
Untitled
No ratings yet
Untitled
5 pages
Lzlabs Software Defined Mainframe Product Data Sheet
No ratings yet
Lzlabs Software Defined Mainframe Product Data Sheet
4 pages
Blockumulus: A Scalable Framework For Smart Contracts On The Cloud
No ratings yet
Blockumulus: A Scalable Framework For Smart Contracts On The Cloud
11 pages
DATA Base System Class Notes 2024
No ratings yet
DATA Base System Class Notes 2024
22 pages
Data Types
No ratings yet
Data Types
3 pages
Machine Learning Techniques
100% (2)
Machine Learning Techniques
45 pages
Ebook The Evolution of The Data Warehouse
No ratings yet
Ebook The Evolution of The Data Warehouse
40 pages
UNIT-I Data Science
No ratings yet
UNIT-I Data Science
96 pages
Multi-Indexed Files: Outline: Inverted Files Multilist Files
No ratings yet
Multi-Indexed Files: Outline: Inverted Files Multilist Files
9 pages
Library Management Sytem
100% (16)
Library Management Sytem
36 pages
Database Testing PDF
No ratings yet
Database Testing PDF
4 pages
Django For Beginners PDF
No ratings yet
Django For Beginners PDF
166 pages
Beginning Database Design
No ratings yet
Beginning Database Design
2 pages
DB Access
No ratings yet
DB Access
6 pages
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
100% (1)
Decision Trees: at Some Point of Time You Have To Take A Decision Sitting On A Tree
19 pages
Project Scope
No ratings yet
Project Scope
4 pages
Aiml Iii
No ratings yet
Aiml Iii
28 pages
ASCP Data Collection Technical Workshop GOOD
100% (3)
ASCP Data Collection Technical Workshop GOOD
14 pages
M.E. Bda 2021
No ratings yet
M.E. Bda 2021
64 pages

Bda Exp8

Uploaded by

Bda Exp8

Uploaded by

Experiment No-08

Prerequisite : Ensure that python IDE is installed, configured and is running.

The steps for the Flajolet-Martin algorithm are:

2. 1/4 of them to have their binary representation end in 00 (i.e. divisible by 4)

3. Called the tail length.

4. Example: 000101 has tail length 0; 101000 has tail length 3.

5. Record R = the maximum r(a) seen for any a in the stream.

6. Estimate (based on this hash function) = 2^R.

a. Consider the stream ={10,12,13,3,4,25,7,10,4}

b. Let the hash function be h(a)=(3x+1)%5.

i.h(10) =(3(10)+1)%5 = 31%5= 1 (001)

ii. h(12) = (3(12)+1)%5 = 37%5 = 2 (010)

iii. h(13) = (3(13)+1)%5 = 40%5 = 0 (000)

iv. h(3) = (3(3)+1)%5 = 10%5 = 0 (000)

v. h(4) = (3(4)+1)%5 = 13%5 = 3 (011)

vi. h(25) = (3(25)+1)%5 = 76%5 = 1 (001)

vii. h(7) = (3(7)+1)%5 = 22%5 = 2 (010)

viii. h(10) = (3(10)+1)%5 = 31%5 = 1 (001)

ix. h(4) = (3(4)+1)%5 = 13%5 = 3 (011)

"""Simple hash function using multiplication and modulo."""

return (x % 11) # Modify if needed

"""Count trailing zeros in a binary string."""

return len(binary_str) - len(binary_str.rstrip('0'))

# User input for stream size

stream_size = int(input("Enter the stream size: "))

# Taking input for stream elements

stream_data = list(map(int, input("Enter the stream elements separated by space: ").split()))

# Check if the entered elements match the stream size

for num in stream_data:

binary_value = bin(h_value)[2:] # Convert to binary string

results.append([num, h_value, binary_value, trailing_zeros])

# Estimate final count

# Display results in table format

df = pd.DataFrame(results, columns=["Stream Value", "Hash Value", "Binary", "Trailing Zeros"])

print(f"\nEstimated number of distinct elements: {estimated_count}")

1] with hash function x% 11 with stream size 6

With stream size 10

With stream size 15

2] hash function x % 15 with stream size 6

Different hash function (x * 2 ) % 30

You might also like