0% found this document useful (0 votes)
21 views76 pages

Lec1 Bloom Distinctcount

Uploaded by

vedikakute06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views76 pages

Lec1 Bloom Distinctcount

Uploaded by

vedikakute06
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 76

Data Science for Big data

Sampling from a stream


Hash functions

We call to be 2-universal if:

One simple way would be to choose independently for every but that
is expensive !!!

For our course, we will assume all are chosen all independently!
Querying

Present?

Naïve algorithm: linear in dataset size

17
Querying
ISBN present in collection?

IP seen by switch?

10.0.21.102

18
Solutions
• Universe , but need to store a set of items,
• Hash table of size :
• Space
• Query time

19
Solutions
• Universe , but need to store a set of items,
• Hash table of size :
• Space
• Query time
• Bit array of size
• Space =
• Query time

20
Solutions
• Universe , but need to store a set of items,
• Hash table of size :
• Space
• Query time
• Bit array of size
• Space =
• Query time

21
Querying, Monte Carlo style
• In hash table construction, we used random hash functions
• we never return incorrect answer
• query time is a random variable
• These are Las Vegas algorithms

• In Monte-Carlo randomized algorithms, we are allowed to


return incorrect answers with (small) probability, say,

22
Bloom filter
[Bloom, 1970]

• A bit-array
• hash functions, , each

𝒙
h1 h3
h2

23
Bloom filter
• A bit-array
• hash functions, , each

𝒙
h1 h3
h2

24
Operations

• for

25
Operations

• for

• for

• If return PRESENT, else ABSENT

26
Bloom Filter
• If the element has been added to the Bloom filter, then ….

• If has not been added to the filter before?


• (B, x) ….

27
Bloom Filter
• If the element has been added to the Bloom filter, then always returns PRESENT

• If has not been added to the filter before?


• sometimes still return PRESENT

h1 𝒙 𝒚 𝒒𝒖𝒆𝒓𝒚

28
Designing Bloom Filter
• Want to minimize the probability that we return a false positive
• Parameters and number of hash functions
• normal bit-array

• What is effect of changing k?

29
Effect of number of hash functions

• Increasing
• Possibly makes it harder for false positives to happen in because of

• But also increases the number of filled up positions


• We can analyse to find out an “optimal k”

30
False positive analysis
• elements inserted
• Ifhas not been inserted, what is the probability that returns PRESENT?

31
False positive analysis
• elements inserted
• Ifhas not been inserted, what is the probability that returns PRESENT?
• Assume are independent and for all positions

32
False positive analysis
• The expected number of zero bits w.h.p.

• Pr[PRESENT)

• Can we choose to minimize this probability

33
Choosing number of hash functions

• Log (False Positive) =

Minimized at , i.e.

34
Bloom filter design
• This “optimal” choice gives false positive =

• If we want a false positive rate of set

Example: If we want FPR, we need hash functions and total bits

36
Applications
• Widespread applications whenever small false positives are tolerable
• Used by browsers
• to decide whether an URL is potentially malicious: a BF is used in
browser, and positives are actually checked with the server, also to reject
common passwords
• Databases e.g. BigTable, HBase, Cassandra, Postgresql use BF to avoid
disk lookups for non-existent rows/columns
• Bitcoin for wallet synchronization, “simplified payment verification”
(answering questions e.g. “is user A interested in transaction B”)

37
Handling deletions
• Chief drawback is that BF does not allow deletions
• Counting Bloom Filter [Fan et al 00]

• Every entry in BF is a small counter rather than a single bit


• increments all counters for by 1
• decrements all by 1, if is present
• maintains 4 bits per counter
• False negatives can happen, but only with low probability

38
Other Extensions

• Many recent work on Bloom filters


• Can we do with less hashing?

• Can BFs be compressed (needed for distributed systems)

• Are there better structures that use less space, less randomness and
less memory lookups?

39
Streaming problem: distinct count
• Universe is , number of distinct elements = stream size is , potentially much
bigger than
• Example: all IP addresses
10.1.21.10, 10.93.28,1,…..,98.0.3.1,…..10.93.28.1…..

• IPs can repeat


• Want to estimate the number of distinct elements in the stream

40
Other applications

• Universe = set of all k-grams, stream is generated by document corpus


• need number of distinct k-grams seen in corpus

• Universe = telephone call records, stream generated by tuples (caller,


callee)
• need number of phones that made > 0 calls

41
Solutions
• Naïve solution :
• store all the elements, sort and count distinct
• store a hash map, insert only if not present in map

• Bit array:
• bits initialized to 1 only if element seen in stream

• Can we do this in less space? Not when exact solution needed!!

42
Solutions
• Naïve solution : space
• store all the elements, sort and count distinct
• store a hash map, insert only if not present in map

• Bit array: space


• bits initialized to 1 only if element seen in stream

43
Approximations

• approximations
• Algorithm will use random hash functions
• Will return an answer such that
^ ≤ ( 1+𝜖 ) 𝑛
( 1 −𝜖 ) 𝑛 ≤ 𝑛
• This will happen with probability over the randomness of the
algorithm

44
First effort
• Stream length: , universe size:
• Proposed algo: Given space S, sample S items from the stream
• Find the number of distinct elements in this set:
• return

45
First effort
• Stream length: , distinct elements:
• Proposed algo: Given space S, sample S items from the stream
• Find the number of distinct elements in this set:
• return
• Not a constant factor approximation
• 1,1,1,1,…..1,2,3,4,….,n-1

𝑚 −𝑛+1

48
Linear Counting

• Bit array of size initialized to all zero


• Hash function
• When seeing item set

• Estimate n?

49
Linear Counting

• Bit array of size initialized to all zero


• Hash function
• When seeing item set

• of zero entries
• Return estimate

50
Linear Counting Analysis
• Pr[ position remaining 0 ] =
• Expected number of positions at zero =

• Using tail inequalities we can show this is concentrated


• Good theoretical bounds only for , often useful in practice

51
Let’s try something else
• Suppose you have distinct numbers – each chosen uniformly randomly
in

• How many do you expect are divisible by 2


• by 4
• by 8
• …
• by
• by
Let’s try something else
• Suppose you have distinct numbers – each chosen uniformly randomly
in

• How many do you expect are divisible by 2


• by 4
• by 8
• …
What are the characteristics of
• by each of these classes?
• by
• So, if we find out the largest k, such that there exists at least one
number that is divisible by , then….

• Does it help in getting an estimate of ?


Flajolet Martin Sketch
• Components
• “random” hash function for some large
• is a length bit string
• assume it is completely random, can relax assumption

• position of rightmost 1 in bit representation of

•,

Nice historical overview + some math 


https://deepai.org/publication/how-flajolet-processed-streams-with-coin-flips
56
Flajolet Martin Sketch
• Components
• “random” hash function for some large
• is a length bit string
• initially assume it is completely random, can relax

• position of rightmost 1 in bit representation of

•,

Nice historical overview + some math 


https://deepai.org/publication/how-flajolet-processed-streams-with-coin-flips
57
Flajolet Martin Sketch
Initialize:
• Choose a “random” hash function

Process(x)
• if

Estimate:
• return

58
Example h(.)
0110101
1011010
1000100
1111010

59
Space usage
• We need for some say
• by birthday paradox analysis, no collisions with high prob

• Sketch : , needs to have only bits !!!


• Total space usage =

60
Intuition
• Assume hash values are uniformly distributed in [0, max_hash_val]
• The probability that a uniform bit-string
• is divisible by 2 is ½
• is divisible by 4 is ¼
• ….
• is divisible by is
• We don’t expect any of them to be divisible by

61
Formalizing intuition
• set of elements that appeared in stream
• For any indicator of
• number of such that

• Let be final value of after algo has seen all data

62
Proof of FM

• , equivalently,

63
Proof of FM
• , equivalently,

{
1
1 𝑤𝑖𝑡h 𝑝𝑟𝑜𝑏 𝑟
𝑋 𝑟𝑗 = 2
0 𝑒𝑙𝑠𝑒

64
Proof of FM

𝑣𝑎𝑟 (𝑌 𝑟 ) ≤ ∑ 𝐸 [ 𝑋 ] ≤ 𝑛/2
2
𝑟𝑗
𝑟

𝑗∈𝑆
65
Proof of FM

𝑣𝑎𝑟 (𝑌 𝑟 ) ≤ ∑ 𝐸 [ 𝑋 ] ≤ 𝑛/2
Pr [ 𝑌 𝑟 >0 ] = Pr [ 𝑌 𝑟 ≥ 1 ] ≤
1
2
𝐸 [𝑌 𝑟 ] 𝑛
= 𝑟

𝑟𝑗
2
𝑟

𝑗∈𝑆
66
Proof of FM

𝑣𝑎𝑟 (𝑌 𝑟 ) ≤ ∑ 𝐸 [ 𝑋 ] ≤ 𝑛/2 2 𝑟
𝐸 [𝑌 𝑟 ] 𝑛
Pr [ 𝑌 𝑟 >0 ] = Pr [ 𝑌 𝑟 ≥ 1 ] ≤ = 𝑟
1 2

𝑟𝑗
Pr [ 𝑌 𝑟 = 0 ] ≤ Pr [|𝑌 𝑟 − 𝐸 [ 𝑌 𝑟 ]|≥ 𝐸 [ 𝑌 𝑟 ] ] ≤
𝑣𝑎𝑟 ( 𝑌 𝑟 )
𝐸 [𝑌 𝑟 ]
2

𝑟
2
𝑛

𝑗∈𝑆
67
Upper bound
Returned estimate

= smallest integer with

68
Lower bound
Returned estimate

= largest integer with

69
Understanding the bound
• By union bound, with prob

𝑛
^ ≤4𝑛
≤𝑛
4
• Can get somewhat better constants
• Need only 2-wise independent hash functions, since we only used
variances

70
Improving the probabilities
• To improve the probabilities, a common trick: median of estimates
• Create ,…., in parallel
• return median
• Expect at most of them to exceed

71
Improving the probabilities
• To improve the probabilities, a common trick: median of estimates
• Create ,…., in parallel
• return median
• Expect at most of them to exceed
• But if median exceeds , then of them does  using Chernoff bound this prob is

72
Improving the probabilities
• To improve the probabilities, a common trick: median of estimates
• Create ,…., in parallel
• return median

• Using Chernoff bound, can show that median will lie in with probability .
• Given error prob choose
• To get a estimate within , with probability
• First calculate mean of estimates
• Then calculate median of such estimates

73
Summary

• Streaming model– useful abstraction


• Estimating basic statistics also nontrivial

• Estimating number of distinct elements


• Linear counting
• Flajolet Martin

74
References:

• Primary reference for this lecture


• Survey on Bloom Filter, Broder and Mitzenmacher 2005,
https://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
• http://www.firatatagun.com/blog/2016/09/25/bloom-filters-explanation-use-cases-
and-examples/

• Others
• Randomized Algorithms by Mitzenmacher and Upfal.

75
k-wise universal
• For any distinct and any (not necessarily distinct) ,

Pr [ h ( 𝑥 1 )= 𝑦 1 ∧ … h ( 𝑥 𝑘 )= 𝑦 𝑘 ] =𝑚
−𝑘

• Needs only bit of storage

76

You might also like