0% found this document useful (0 votes)

21 views76 pages

Lec1 Bloom Distinctcount

Uploaded by

vedikakute06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views76 pages

Lec1 Bloom Distinctcount

Uploaded by

vedikakute06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 76

Data Science for Big data

Sampling from a stream

Hash functions

We call to be 2-universal if:

One simple way would be to choose independently for every but that
is expensive !!!

For our course, we will assume all are chosen all independently!
Querying

Present?

Naïve algorithm: linear in dataset size

17
Querying
ISBN present in collection?

IP seen by switch?

10.0.21.102

18
Solutions
• Universe , but need to store a set of items,
• Hash table of size :
• Space
• Query time

19
Solutions
• Universe , but need to store a set of items,
• Hash table of size :
• Space
• Query time
• Bit array of size
• Space =
• Query time

20
Solutions
• Universe , but need to store a set of items,
• Hash table of size :
• Space
• Query time
• Bit array of size
• Space =
• Query time

21
Querying, Monte Carlo style
• In hash table construction, we used random hash functions
• we never return incorrect answer
• query time is a random variable
• These are Las Vegas algorithms

• In Monte-Carlo randomized algorithms, we are allowed to

return incorrect answers with (small) probability, say,

22
Bloom filter
[Bloom, 1970]

• A bit-array
• hash functions, , each

𝒙
h1 h3
h2

23
Bloom filter
• A bit-array
• hash functions, , each

𝒙
h1 h3
h2

24
Operations

• for

25
Operations

• for

• If return PRESENT, else ABSENT

26
Bloom Filter
• If the element has been added to the Bloom filter, then ….

• If has not been added to the filter before?

• (B, x) ….

27
Bloom Filter
• If the element has been added to the Bloom filter, then always returns PRESENT

• If has not been added to the filter before?

• sometimes still return PRESENT

h1 𝒙 𝒚 𝒒𝒖𝒆𝒓𝒚

28
Designing Bloom Filter
• Want to minimize the probability that we return a false positive
• Parameters and number of hash functions
• normal bit-array

• What is effect of changing k?

29
Effect of number of hash functions

• Increasing
• Possibly makes it harder for false positives to happen in because of

• But also increases the number of filled up positions

• We can analyse to find out an “optimal k”

30
False positive analysis
• elements inserted
• Ifhas not been inserted, what is the probability that returns PRESENT?

31
False positive analysis
• elements inserted
• Ifhas not been inserted, what is the probability that returns PRESENT?
• Assume are independent and for all positions

32
False positive analysis
• The expected number of zero bits w.h.p.

• Pr[PRESENT)

• Can we choose to minimize this probability

33
Choosing number of hash functions

• Log (False Positive) =

Minimized at , i.e.

34
Bloom filter design
• This “optimal” choice gives false positive =

• If we want a false positive rate of set

Example: If we want FPR, we need hash functions and total bits

36
Applications
• Widespread applications whenever small false positives are tolerable
• Used by browsers
• to decide whether an URL is potentially malicious: a BF is used in
browser, and positives are actually checked with the server, also to reject
common passwords
• Databases e.g. BigTable, HBase, Cassandra, Postgresql use BF to avoid
disk lookups for non-existent rows/columns
• Bitcoin for wallet synchronization, “simplified payment verification”
(answering questions e.g. “is user A interested in transaction B”)

37
Handling deletions
• Chief drawback is that BF does not allow deletions
• Counting Bloom Filter [Fan et al 00]

• Every entry in BF is a small counter rather than a single bit

• increments all counters for by 1
• decrements all by 1, if is present
• maintains 4 bits per counter
• False negatives can happen, but only with low probability

38
Other Extensions

• Many recent work on Bloom filters

• Can we do with less hashing?

• Can BFs be compressed (needed for distributed systems)

• Are there better structures that use less space, less randomness and
less memory lookups?

39
Streaming problem: distinct count
• Universe is , number of distinct elements = stream size is , potentially much
bigger than
• Example: all IP addresses
10.1.21.10, 10.93.28,1,…..,98.0.3.1,…..10.93.28.1…..

• IPs can repeat

• Want to estimate the number of distinct elements in the stream

40
Other applications

• Universe = set of all k-grams, stream is generated by document corpus

• need number of distinct k-grams seen in corpus

• Universe = telephone call records, stream generated by tuples (caller,

callee)
• need number of phones that made > 0 calls

41
Solutions
• Naïve solution :
• store all the elements, sort and count distinct
• store a hash map, insert only if not present in map

• Bit array:
• bits initialized to 1 only if element seen in stream

• Can we do this in less space? Not when exact solution needed!!

42
Solutions
• Naïve solution : space
• store all the elements, sort and count distinct
• store a hash map, insert only if not present in map

• Bit array: space

• bits initialized to 1 only if element seen in stream

43
Approximations

• approximations
• Algorithm will use random hash functions
• Will return an answer such that
^ ≤ ( 1+𝜖 ) 𝑛
( 1 −𝜖 ) 𝑛 ≤ 𝑛
• This will happen with probability over the randomness of the
algorithm

44
First effort
• Stream length: , universe size:
• Proposed algo: Given space S, sample S items from the stream
• Find the number of distinct elements in this set:
• return

45
First effort
• Stream length: , distinct elements:
• Proposed algo: Given space S, sample S items from the stream
• Find the number of distinct elements in this set:
• return
• Not a constant factor approximation
• 1,1,1,1,…..1,2,3,4,….,n-1

𝑚 −𝑛+1

48
Linear Counting

• Bit array of size initialized to all zero

• Hash function
• When seeing item set

• Estimate n?

49
Linear Counting

• Bit array of size initialized to all zero

• Hash function
• When seeing item set

• of zero entries
• Return estimate

50
Linear Counting Analysis
• Pr[ position remaining 0 ] =
• Expected number of positions at zero =

• Using tail inequalities we can show this is concentrated

• Good theoretical bounds only for , often useful in practice

51
Let’s try something else
• Suppose you have distinct numbers – each chosen uniformly randomly
in

• How many do you expect are divisible by 2

• by 4
• by 8
• …
• by
• by
Let’s try something else
• Suppose you have distinct numbers – each chosen uniformly randomly
in

• How many do you expect are divisible by 2

• by 4
• by 8
• …
What are the characteristics of
• by each of these classes?
• by
• So, if we find out the largest k, such that there exists at least one
number that is divisible by , then….

• Does it help in getting an estimate of ?

Flajolet Martin Sketch
• Components
• “random” hash function for some large
• is a length bit string
• assume it is completely random, can relax assumption

• position of rightmost 1 in bit representation of

•,

Nice historical overview + some math 

https://deepai.org/publication/how-flajolet-processed-streams-with-coin-flips
56
Flajolet Martin Sketch
• Components
• “random” hash function for some large
• is a length bit string
• initially assume it is completely random, can relax

• position of rightmost 1 in bit representation of

•,

Nice historical overview + some math 

https://deepai.org/publication/how-flajolet-processed-streams-with-coin-flips
57
Flajolet Martin Sketch
Initialize:
• Choose a “random” hash function

Process(x)
• if

Estimate:
• return

58
Example h(.)
0110101
1011010
1000100
1111010

59
Space usage
• We need for some say
• by birthday paradox analysis, no collisions with high prob

• Sketch : , needs to have only bits !!!

• Total space usage =

60
Intuition
• Assume hash values are uniformly distributed in [0, max_hash_val]
• The probability that a uniform bit-string
• is divisible by 2 is ½
• is divisible by 4 is ¼
• ….
• is divisible by is
• We don’t expect any of them to be divisible by

61
Formalizing intuition
• set of elements that appeared in stream
• For any indicator of
• number of such that

• Let be final value of after algo has seen all data

62
Proof of FM

• , equivalently,

63
Proof of FM
• , equivalently,

{
1
1 𝑤𝑖𝑡h 𝑝𝑟𝑜𝑏 𝑟
𝑋 𝑟𝑗 = 2
0 𝑒𝑙𝑠𝑒

64
Proof of FM

𝑣𝑎𝑟 (𝑌 𝑟 ) ≤ ∑ 𝐸 [ 𝑋 ] ≤ 𝑛/2
2
𝑟𝑗
𝑟

𝑗∈𝑆
65
Proof of FM

𝑣𝑎𝑟 (𝑌 𝑟 ) ≤ ∑ 𝐸 [ 𝑋 ] ≤ 𝑛/2
Pr [ 𝑌 𝑟 >0 ] = Pr [ 𝑌 𝑟 ≥ 1 ] ≤
1
2
𝐸 [𝑌 𝑟 ] 𝑛
= 𝑟

𝑟𝑗
2
𝑟

𝑗∈𝑆
66
Proof of FM

𝑣𝑎𝑟 (𝑌 𝑟 ) ≤ ∑ 𝐸 [ 𝑋 ] ≤ 𝑛/2 2 𝑟
𝐸 [𝑌 𝑟 ] 𝑛
Pr [ 𝑌 𝑟 >0 ] = Pr [ 𝑌 𝑟 ≥ 1 ] ≤ = 𝑟
1 2

𝑟𝑗
Pr [ 𝑌 𝑟 = 0 ] ≤ Pr [|𝑌 𝑟 − 𝐸 [ 𝑌 𝑟 ]|≥ 𝐸 [ 𝑌 𝑟 ] ] ≤
𝑣𝑎𝑟 ( 𝑌 𝑟 )
𝐸 [𝑌 𝑟 ]
2
≤
𝑟
2
𝑛

𝑗∈𝑆
67
Upper bound
Returned estimate

= smallest integer with

68
Lower bound
Returned estimate

= largest integer with

69
Understanding the bound
• By union bound, with prob

𝑛
^ ≤4𝑛
≤𝑛
4
• Can get somewhat better constants
• Need only 2-wise independent hash functions, since we only used
variances

70
Improving the probabilities
• To improve the probabilities, a common trick: median of estimates
• Create ,…., in parallel
• return median
• Expect at most of them to exceed

71
Improving the probabilities
• To improve the probabilities, a common trick: median of estimates
• Create ,…., in parallel
• return median
• Expect at most of them to exceed
• But if median exceeds , then of them does  using Chernoff bound this prob is

72
Improving the probabilities
• To improve the probabilities, a common trick: median of estimates
• Create ,…., in parallel
• return median

• Using Chernoff bound, can show that median will lie in with probability .
• Given error prob choose
• To get a estimate within , with probability
• First calculate mean of estimates
• Then calculate median of such estimates

73
Summary

• Streaming model– useful abstraction

• Estimating basic statistics also nontrivial

• Estimating number of distinct elements

• Linear counting
• Flajolet Martin

74
References:

• Primary reference for this lecture

• Survey on Bloom Filter, Broder and Mitzenmacher 2005,
https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
• http://www.firatatagun.com/blog/2016/09/25/bloom-filters-explanation-use-cases-
and-examples/

• Others
• Randomized Algorithms by Mitzenmacher and Upfal.

75
k-wise universal
• For any distinct and any (not necessarily distinct) ,

Pr [ h ( 𝑥 1 )= 𝑦 1 ∧ … h ( 𝑥 𝑘 )= 𝑦 𝑘 ] =𝑚
−𝑘

• Needs only bit of storage

How To Code For Quantum Computers
From Everand
How To Code For Quantum Computers
Nivio Dos Santos
No ratings yet
WRM Year8 Spring Block 1 Brackets Equations Inequalities Exemplar Questions and Answers
No ratings yet
WRM Year8 Spring Block 1 Brackets Equations Inequalities Exemplar Questions and Answers
87 pages
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
DGIM
No ratings yet
DGIM
90 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Hashing
No ratings yet
Hashing
111 pages
Aws Glossary
No ratings yet
Aws Glossary
184 pages
RT 900 User Guide
No ratings yet
RT 900 User Guide
83 pages
3.flajolet Martin Algorithm
No ratings yet
3.flajolet Martin Algorithm
31 pages
Flajolet-Martin Algorithm
No ratings yet
Flajolet-Martin Algorithm
28 pages
24 SimilaritySearch
No ratings yet
24 SimilaritySearch
52 pages
MMD 05
No ratings yet
MMD 05
50 pages
Bloom Filter
No ratings yet
Bloom Filter
50 pages
(It-Ebooks-2017) It-Ebooks - Stanford CS109 Probability For Computer Scientists Lecture Notes-iBooker It-Ebooks (2017)
No ratings yet
(It-Ebooks-2017) It-Ebooks - Stanford CS109 Probability For Computer Scientists Lecture Notes-iBooker It-Ebooks (2017)
71 pages
Bda PT 2
No ratings yet
Bda PT 2
35 pages
DSBD Unit-II 3
No ratings yet
DSBD Unit-II 3
28 pages
Data Stream Sampling
No ratings yet
Data Stream Sampling
25 pages
Streams 2
No ratings yet
Streams 2
49 pages
Lec 31 Handout
No ratings yet
Lec 31 Handout
18 pages
Algorithms For Massive Data Problems
No ratings yet
Algorithms For Massive Data Problems
28 pages
Probablistic Data Structures
No ratings yet
Probablistic Data Structures
5 pages
Blooms Filter
No ratings yet
Blooms Filter
15 pages
SoICT-Eng - ProbComp - Lec 5
No ratings yet
SoICT-Eng - ProbComp - Lec 5
41 pages
Bda Ut-2
No ratings yet
Bda Ut-2
34 pages
Rsa 2008
No ratings yet
Rsa 2008
32 pages
Manuel #1116649 (FM841, FM840) Rig 301-52
No ratings yet
Manuel #1116649 (FM841, FM840) Rig 301-52
101 pages
Probabilistic Counting Algorithms For Database Applications - Flajolet
No ratings yet
Probabilistic Counting Algorithms For Database Applications - Flajolet
28 pages
Unit 2 Mathematical Foundation of Big Data: - Syllabus
No ratings yet
Unit 2 Mathematical Foundation of Big Data: - Syllabus
26 pages
Data Science 5
No ratings yet
Data Science 5
82 pages
SoICT-Eng - ProbComp - Lec 4
No ratings yet
SoICT-Eng - ProbComp - Lec 4
32 pages
Probabilistic Data Structures
No ratings yet
Probabilistic Data Structures
26 pages
DSBDA UT 2 Part 2
No ratings yet
DSBDA UT 2 Part 2
21 pages
Unidad de Corte 5510
No ratings yet
Unidad de Corte 5510
20 pages
Presentation 17
No ratings yet
Presentation 17
18 pages
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
No ratings yet
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
90 pages
113 Trellix NX 4600 Ds Trellix Network Security Tech Specifications Datasheet
No ratings yet
113 Trellix NX 4600 Ds Trellix Network Security Tech Specifications Datasheet
9 pages
PBL PPT Suraj
No ratings yet
PBL PPT Suraj
15 pages
Pros and Cons of e Banking
No ratings yet
Pros and Cons of e Banking
2 pages
ControlCase Compliance Manager Start-Up Manual v1.1
No ratings yet
ControlCase Compliance Manager Start-Up Manual v1.1
19 pages
Module 4
No ratings yet
Module 4
10 pages
Bloom Filter Guo
No ratings yet
Bloom Filter Guo
90 pages
Bloom Filter
No ratings yet
Bloom Filter
9 pages
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
No ratings yet
Compsci Algorithms For Data Science: Cameron Musco University of Massachusetts Amherst. Fall 2019
28 pages
Bloom Filter
No ratings yet
Bloom Filter
29 pages
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
No ratings yet
Streaming Algorithm: Filtering & Counting Distinct Elements: Compsci 590.02 Instructor: Ashwinmachanavajjhala
26 pages
SOP For E-Mail Security Policy - v. 1.0
No ratings yet
SOP For E-Mail Security Policy - v. 1.0
8 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Unit-5 Operator Overloading
No ratings yet
Unit-5 Operator Overloading
8 pages
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
No ratings yet
Viden Io Data Analytics Lecture7 Data Stream Filtering PDF
20 pages
Normal Probability Distribution
No ratings yet
Normal Probability Distribution
15 pages
HTML File Paths
No ratings yet
HTML File Paths
7 pages
Rashed
No ratings yet
Rashed
9 pages
Unit 4 - 4.4
No ratings yet
Unit 4 - 4.4
23 pages
Assocrules 2
No ratings yet
Assocrules 2
49 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Bloom Filters: References
No ratings yet
Bloom Filters: References
22 pages
Car Basic Mechanics
No ratings yet
Car Basic Mechanics
3 pages
Manual Bda 6 7 8
No ratings yet
Manual Bda 6 7 8
6 pages
Lista de Accesorios Nueva
No ratings yet
Lista de Accesorios Nueva
11 pages
CS 561, Lecture 2: Randomization in Data Structures: Jared Saia University of New Mexico
No ratings yet
CS 561, Lecture 2: Randomization in Data Structures: Jared Saia University of New Mexico
46 pages
Bda Exp8
No ratings yet
Bda Exp8
4 pages
Bda Exp4 Chinmay
No ratings yet
Bda Exp4 Chinmay
4 pages
HW 2 Sol
No ratings yet
HW 2 Sol
5 pages
Rapoo C1612 Brochure
No ratings yet
Rapoo C1612 Brochure
4 pages
High Speed Hashing For Integers
No ratings yet
High Speed Hashing For Integers
17 pages
Experiment No 8
No ratings yet
Experiment No 8
7 pages
Book 160 163
No ratings yet
Book 160 163
4 pages
FM Algorithm
No ratings yet
FM Algorithm
3 pages
Bloomfilter
No ratings yet
Bloomfilter
9 pages
PNZ Series
No ratings yet
PNZ Series
2 pages
Question Bank Sybbi It Sem 3 2024-25
No ratings yet
Question Bank Sybbi It Sem 3 2024-25
2 pages
Social Entrepreneurship: Assignment 1: Social Enterprise and Entrepreneur Desicrew Solutions and Saloni Malhotra
No ratings yet
Social Entrepreneurship: Assignment 1: Social Enterprise and Entrepreneur Desicrew Solutions and Saloni Malhotra
3 pages
Lect1004 PDF
No ratings yet
Lect1004 PDF
7 pages
World of Magic
No ratings yet
World of Magic
3 pages
Block Diagram: X541UV Repair Guide
No ratings yet
Block Diagram: X541UV Repair Guide
7 pages
Ticket Muenchen Berlin 3165580741
No ratings yet
Ticket Muenchen Berlin 3165580741
1 page
Continuously Reinforced Concrete Pavement
100% (1)
Continuously Reinforced Concrete Pavement
2 pages
Expectation of Geometric Distribution Variance and Standard Deviation
No ratings yet
Expectation of Geometric Distribution Variance and Standard Deviation
5 pages
Assignment No.2: HOANG Nguyen Phong
No ratings yet
Assignment No.2: HOANG Nguyen Phong
6 pages
Laravel Lifecycle
No ratings yet
Laravel Lifecycle
2 pages
6 Filtering and Streaming: 6.1 Bloom Filters
No ratings yet
6 Filtering and Streaming: 6.1 Bloom Filters
6 pages
Advanced Algorithms Course. Lecture Notes. Part 10: Hashing
No ratings yet
Advanced Algorithms Course. Lecture Notes. Part 10: Hashing
4 pages
Wpq-105-03 Gmaw 3g Jose A. Rivas
No ratings yet
Wpq-105-03 Gmaw 3g Jose A. Rivas
1 page
1 Overview: Lecture 2 - February 3, 2005
No ratings yet
1 Overview: Lecture 2 - February 3, 2005
6 pages
Parts Diagram and Description
No ratings yet
Parts Diagram and Description
8 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
No ratings yet
Bloom Filters - Short Tutorial: Web Cache Sharing ( (3) ) Collaborating Web Caches Use Bloom Filters (Dubbed
4 pages
Assignment 3
No ratings yet
Assignment 3
3 pages
Laporan Prestasi Pekerjaan Pemasangan Elevator
No ratings yet
Laporan Prestasi Pekerjaan Pemasangan Elevator
12 pages
Experience Summary: Vijaya Bhaskar P
No ratings yet
Experience Summary: Vijaya Bhaskar P
3 pages

Lec1 Bloom Distinctcount

Uploaded by

Lec1 Bloom Distinctcount

Uploaded by

Data Science for Big data

Sampling from a stream

We call to be 2-universal if:

Naïve algorithm: linear in dataset size

• In Monte-Carlo randomized algorithms, we are allowed to

• If return PRESENT, else ABSENT

• If has not been added to the filter before?

• If has not been added to the filter before?

• What is effect of changing k?

• But also increases the number of filled up positions

• Can we choose to minimize this probability

• Log (False Positive) =

• If we want a false positive rate of set

Example: If we want FPR, we need hash functions and total bits

• Every entry in BF is a small counter rather than a single bit

• Many recent work on Bloom filters

• Can BFs be compressed (needed for distributed systems)

• IPs can repeat

• Universe = set of all k-grams, stream is generated by document corpus

• Universe = telephone call records, stream generated by tuples (caller,

• Can we do this in less space? Not when exact solution needed!!

• Bit array: space

• Bit array of size initialized to all zero

• Bit array of size initialized to all zero

• Using tail inequalities we can show this is concentrated

• How many do you expect are divisible by 2

• How many do you expect are divisible by 2

• Does it help in getting an estimate of ?

• position of rightmost 1 in bit representation of

Nice historical overview + some math 

• position of rightmost 1 in bit representation of

Nice historical overview + some math 

• Sketch : , needs to have only bits !!!

• Let be final value of after algo has seen all data

= smallest integer with

= largest integer with

• Streaming model– useful abstraction

• Estimating number of distinct elements

• Primary reference for this lecture

• Needs only bit of storage

You might also like