0% found this document useful (0 votes)
5 views10 pages

Module 4

A Bloom filter is a probabilistic data structure used in Big Data Analytics to efficiently test set membership, allowing for space-saving representation of large datasets while guaranteeing no false negatives. Applications include spam email checking, URL authentication, and filtering seen posts on platforms like Medium and Quora. By using multiple hash functions and a bit array, Bloom filters can manage large sets with a controlled probability of false positives, making them suitable for scenarios where memory is limited.

Uploaded by

Biya Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views10 pages

Module 4

A Bloom filter is a probabilistic data structure used in Big Data Analytics to efficiently test set membership, allowing for space-saving representation of large datasets while guaranteeing no false negatives. Applications include spam email checking, URL authentication, and filtering seen posts on platforms like Medium and Quora. By using multiple hash functions and a bit array, Bloom filters can manage large sets with a controlled probability of false positives, making them suitable for scenarios where memory is limited.

Uploaded by

Biya Rahul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Questions:

How Bloom filter is useful in Big Data Analytics, explain with suitable
example?
Bloom Filter: A Bloom filter is a space-efficient probabilistic data
structure that is

• used to test whether an element is a member of a


set.

• False positive matches are possible, but false


negatives are not, thus a Bloom filter has a 100%
recall rate.

• a query returns either “possibly in set” or “definitely


not in set”.
Unique properties of bloom filters:

1.Unlike a standard hash table , a bloom filter with fixed size can represent a set
with large number of elements. It does not result in a situation such as “filled up”
data structure. However, the probability of false positives increases as the entries
made to it increases until all the bits are set to 1, after which all the queries will
result in positive

2.It is not possible to delete an element from bloom filter because if we delete the
bits at indices generated by hash function for a single element bits corresponding to
other elements will also get deleted.

3.Blooms may result in false positive but never in false negative. Which means that
it may tell you that an entry is present while it may not be present in the set (false
positive) but it would never say its not while its there (false negative).
Bloom Filtering
• For a set or list, and space is an issue,
• It selects tuples that satisfy the given criteria and others are rejected

• Two categories of Selection criteria:


• 1) the property of tuple is calculated from tuple itself:
• Eg From Continuous streams of feeds on twitter, search a specific word
or phrase
• 2) Set membership eg. Search if something belongs to the incoming
stream eg. URL authentication, email-id generation, spam emails
• Category 2 is hard , what if set is too huge?
Remember !!!!

• Bloom Filters may give you false positive but would never
yield false negative

• Means only two cases surely not present( true Negative) or


may be present( false positive)
Applications of Bloom filters
• Medium uses bloom filters for recommending post to users by
filtering post which have been seen by user.
• Quora implemented a shared bloom filter in the feed back-end to
filter out stories that people have seen before.
• The Google Chrome web browser used to use a Bloom filter to
identify malicious URLs
• Google BigTable, Apache HBase and Apache Cassandra, and
Postgresql use Bloom filters to reduce the disk lookups for
non-existent rows or columns.
Example of spam email check
• S -> set of safe and trusted one billion email addresses each of size 20 bytes and the stream
consists of (email address and email message)
• S too large and cant be accommodated in memory thus perform set membership in S (if disk
is used to store these emails, fetching and processing will be too slow)
• So use only main memory.
• Assume 1 GB memory (as 8billion bits), which can be used as bit array of 8 billion locations
• Hash function to map email addresses in S to one of the 8 billion locations in an array
• All those get mapped from S are set to 1 and other locations are set to 0
• Out of 1 billion email addresses in S and 8 billion bit locations in main memory so
approximately less than 1/8th of total available bits will be set 1
• When a new stream arrives, use hash functions to hash it, check the bit location to which it is
hashed
• If that bit location is all 1, them email is from safe and trusted sender. On other hand if any
bit is 0, then email is spam
Example
This can never be false
Negative. (Its true negative)
The design parameters of Bloom Filters :
Size of Bloom Filter:
If m is the number of bits in the array,
k is the number of hash functions and
n is the number of elements expected to be inserted in the filter,
the probability P of false positive turns out to be approximately

depending on what false positive rate our application can tolerate and if we know approximately
how many elements are going to be inserted in the filter we can control the probability.

So by varying different values for k and m we can set the size of the filter and its probability to generate false
positives. The larger the filter , the less is the probability of false positives.
Number of Hash Functions:
The more the number of hash functions , the less the false probability as the quicker it fills up .
But more hash functions makes the filter slower.
The relation between number of hash functions k , the size of filter m and the number of expected
entries n is
optimized number of hash functions that we will require for efficient working
of filter at a desirable false positive rates.
https://www.geeksforgeeks.org/bloom-filters-introd
uction-and-python-implementation/

https://www.enjoyalgorithms.com/blog/bloom-filter

END of Section 4.3

You might also like