DGIM
DGIM
1
The stream model
• Data sequentially enters at a rapid rate from one or more inputs
• Processing in real-time
2
Counting bits with
DGIM algorithm
3
Sliding windows
• A useful model : queries are about a window of length N
• Or, there are so many streams that windows for all cannot be stored
4
Problem description
• Problem
• Given a stream of 0’s and 1’s
• Answer queries of the form “how many 1’s in the last k bits?” where k ≤ N
• Obvious solution
• Store the most recent N bits (i.e., window size = N)
• When a new bit arrives, discard the N +1st bit
• Real Problem
• Slow ‐ need to scan k‐bits to count
• What if we cannot afford to store N bits?
• Estimate with an approximate answer
5
Datar-Gionis-Indyk-Motwani Algorithm (DGIM)
Overview
• Approximate answer
6
Main idea of the algorithm
7
Timestamps
• Each bit in the stream has a timestamp - the position in the stream from the
beginning.
• Store the most recent timestamp to identify the position of any other bit in the
window
8
Buckets
• Each bucket has two components:
• Size is always 2𝑗 .
9
RULES FOR FORMING THE BUCKETS:
1. The right side of the bucket should always start with 1. (if it starts with a 0,it is to be
neglected) E.g. · 1001011 → a bucket of size 4 ,having four 1’s and starting with 1 on
it’s right end.
2. Every bucket should have at least one 1, else no bucket can be formed.
4. The buckets cannot decrease in size as we move to the left. (move in increasing order
towards left)
10
Representing the stream by buckets
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• Buckets do not overlap.
• There are one or two buckets of any given size, up to some maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left (back in time).
11
Updating buckets when a new bit arrives
• Drop the last bucket if it has no overlap with the window
• Create a new bucket with it. Size = 1, timestamp = current time modulo N.
• If there are 3 buckets of size 1, merge two oldest into one of size 2.
• If there are 3 buckets of size 2, merge two oldest into one of size 4.
• ...
12
Example of updating process
13
Query Answering
How many ones are in the most recent k bits?
14
k
Memory requirements
15
Performance guarantee
• Suppose the last bucket has size 2𝑟 .
16
References
18
Bloom Filter
Presented by-
Naheed Anjum Arafat
19
Motivation:
The “Set Membership” Problem
• x: An Element
Streaming Algorithm:
• S: A Set of elements (Finite) • Limited Space/item
• Limited Processing time/item
• Input: x, S • Approximate answer based on a summary/sketch
of the data stream in the memory.
• Output:
• True (if x in S)
• False (if x not in S)
20
Bloom Filter
• Consists of
• vector of n Boolean values, initially all set false (Complexity:- O(n) )
• k independent and uniform hash functions, ℎ0 , ℎ1 , … , ℎk−1
each outputs a value within the range {0, 1, … , n-1}
F F F F F F F F F F
0 1 2 3 4 5 6 7 8 9
n = 10
21
Bloom Filter
• For each element sϵS, the Boolean value at positions
ℎ0 𝑠 , ℎ1 𝑠 , … , ℎ𝑘−1 𝑠 are set true.
• Complexity of Insertion:- O(k)
𝑠1
ℎ0 𝑠1 = 1 ℎ2 𝑠1 = 6
ℎ1 𝑠1 = 4
F TF F F FT F TF F F F
0 1 2 3 4 5 6 7 8 9
k=3
22
Bloom Filter
• For each element sϵS, the Boolean value at positions
ℎ0 𝑠 , ℎ1 𝑠 , … , ℎ𝑘−1 𝑠 are set true.
Note: A particular Boolean value may
be set to True several times.
𝑠1 ℎ0 𝑠2 = 4
𝑠2
ℎ1 𝑠2 = 7 ℎ2 𝑠2 = 9
F T F F T F T TF F FT
0 1 2 3 4 5 6 7 8 9
k=3
23
Algorithm to Approximate Set Membership Query
Input: x ( may/may not be an element)
Runtime Complexity:- O(k)
Output: Boolean
For all i ϵ {0,1,…,k-1}
if hi(x) is False
return False
return True
𝑠1 𝑠2
F T F F T F T T F T
0 1 2 3 4 5 6 7 8 9
𝑥 = S1 𝑥 = S3 k=3
24
Algorithm to Approximate Set Membership Query
False Positive!!
𝑠1 ℎ2 𝑠1 = 6 𝑠2
ℎ0 𝑠2 = 4
ℎ0 𝑠1 = 1 ℎ1 𝑠1 = 4 ℎ2 𝑠2 = 9
ℎ1 𝑠2 = 7
F T F F T F T T F T
0 1 2 3 4 5 6 7 8 9
ℎ1 𝑥 = 6
ℎ2 𝑥 = 1 ℎ0 𝑥 = 9
𝑥
k=3
25
Error Types
• False Negative – Answering “is not there” on an element which “is there”
• Never happens for Bloom Filter
• False Positive – Answering “is there” for an element which “is not there”
• Might happens. How likely?
26
Probability of false positives
S1 S2
F T F T F F T F T F
n = size of table
m = number of items
k = number of hash functions
Consider a particular bit 0 <= j <= n-1
Probability that ℎ𝑖 𝑥 does not set bit j after hashing only 1 item:
1
𝑃 ℎ𝑖 𝑥 ≠ 𝑗 = 1 − 𝑛
Probability that ℎ𝑖 𝑥 does not set bit j after hashing m items:
1 𝑚
𝑃 ∀𝑥 𝑖𝑛 {𝑆1 , 𝑆2 , … , 𝑆𝑚 }: ℎ𝑖 𝑥 ≠ 𝑗 = 1 − 𝑛
27
Probability of false positives
S1 S2
F T F T F F T F T F
n = size of table
m = number of items
k = number of hash functions
Probability that none of the hash functions set bit j after hashing m items:
𝑘𝑚
1
𝑃 ∀𝑥 𝑖𝑛 𝑆1 , 𝑆2 , … , 𝑆𝑚 , ∀𝑖 𝑖𝑛 1,2, … , 𝑘 : ℎ𝑖 (𝑥) ≠ 𝑗 = 1 −
𝑛
1 𝑛 1
We know that, 1 − ≈ e = 𝑒 −1
𝑛
𝑘 𝑚 Τ𝑛
1 𝑘𝑚 1 𝑛 𝑘 𝑚 Τ𝑛
⇒ 1− = 1− ≈ 𝑒 −1 = 𝑒 −𝑘𝑚Τ𝑛 28
𝑛 𝑛
Probability of false positives
S1 S2
F T F T F F T F T F
n = size of table
m = number of items Approximate
Probability of
k = number of hash functions False Positive
For a fixed m, n which value of k will minimize this bound? kopt = log𝑒 2 ⋅ 𝑛Τ𝑚
𝑛 Bit per item
1
The probability of False Positive = ( )𝑘𝑜𝑝𝑡 = (0.6185) 𝑚
2 29
Bloom Filters: cons
30
References
• https://en.wikipedia.org/wiki/Bloom_filter
• Graham Cormode, Sketch Techniques for Approximate Query
Processing, ATT Research
• Michael Mitzenmacher, Compressed Bloom Filters, Harvard
University, Cambridge
31
Count-Min Sketch
Erick Purwanto
A0050717L
Motivation Count-Min Sketch
• Implemented in real system
– AT&T: network switch to analyze network traffic
using limited memory
– Google: implemented on top of MapReduce
parallel processing infrastructure
𝑥… 𝑗 ℎ𝑖 ∶ 1, 𝑚 → [1, 𝑤]
1 ℎ𝑖 (𝑗) 𝑤
𝑥… 𝑗
ℎ1
ℎ2 CM
+1
1
+1
ℎ𝑑
+1
𝑑
1 𝑤
Count-Min Sketch
• Algorithm to estimate Frequency Query:
– Count(𝑗) : 𝑓ሚ𝑗 = min𝑖 CM[𝑖, ℎ𝑖 (𝑗)]
𝑗
ℎ1
ℎ2 CM
1
ℎ𝑑
𝑑
1 𝑤
Collision
• Entry CM 𝑖, ℎ𝑖 𝑗 is an estimate of the frequency
of item 𝑗 at row 𝑖
– for example, ℎ1 5 = ℎ1 2 = 7
𝑥… 3 5 5 8 5 2 5
row 1
1 7 𝑤
row 𝑖
1 ℎ𝑖 (𝑗) 𝑤
= 𝑓𝑗 + 𝑓𝑘
𝑘≠𝑗, ℎ𝑖 𝑘 =ℎ𝑖 𝑗
= 𝑓𝑗 + 𝑋𝑖,𝑗
Count-Min Sketch Analysis
• Let 𝜀 : approximation error, and set 𝑤 = 𝑒Τ𝜀
• The expectation of other item contribution:
E[𝑋𝑖,𝑗 ] = σ𝑘≠𝑗 𝑓𝑘 ⋅ Pr[ ℎ𝑖 𝑘 = ℎ𝑖 𝑗 ]
≤ Pr ℎ𝑖 𝑘 = ℎ𝑖 𝑗 ⋅ σ𝑘 𝑓𝑘 .
1
= ⋅ 𝐹1
𝑤
𝜀
= ⋅ 𝐹1
𝑒
Count-Min Sketch Analysis
• Markov Inequality: Pr[ 𝑋 ≥ 𝑘 ∙ E 𝑋 ] ≤ 1Τ𝑘
= ( Pr 𝑓ሚ𝑖,𝑗 > 𝑓𝑗 + 𝜀 ∙ 𝐹1 )𝑑
ln(1ൗ𝛿 )
1
≤
𝑒
= 𝛿
Count-Min Sketch
• Result
– dynamic data structure CM, item frequency
query
– set 𝑤 = 𝑒Τ𝜀 and 𝑑 = ln(1Τ𝛿)
– with probability at least 1 − 𝛿,
𝑓ሚ𝑗 ≤ 𝑓𝑗 + 𝜀 ∙ σ𝑘 𝑓𝑘
• Objective:
𝑛
– Find all items that occur more than times in the array
𝑘
• there can be at most 𝑘 such items
• Parameter
–𝑘
Heavy Hitters Problem: Naïve Solution
• Trivial solution is to use 𝑂 𝑚 array
1. Store all items and each item’s frequency
𝑛
2. Find all 𝑘 items that has frequencies ≥
𝑘
𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻)
• Relax Heavy Hitters Problem
… j
ℎ1
ℎ2
ℎ𝑑
1
…
𝑑
1 𝑤
Naïve Solution using CMS
• Query the frequency of all 𝑚 items
𝑛
– Return items with Count 𝑗 ≥
𝑘
• 𝑂 𝑚𝑑
– slow
Better Solution
• Use CMS to store the frequency
ℎ𝑑 ℎ2 ℎ1
1 1
…
1 𝑑
1 𝑤
𝑖= 1
EXAMPLES
Min-Heap
4 𝑘=5 {1:4}
𝑖 1
𝑏= =
𝑘 5
ℎ𝑑 ℎ2 ℎ1
1 1
…
1 𝑑
1 𝑤
𝑖= 1 2 3 4 5
EXAMPLES
Min-Heap
4 2 6 9 3
𝑘=5 {1:3}
𝑖 5
𝑏= =
𝑘 5
{1:2} {1:6}
ℎ1 ℎ𝑑 ℎ2
{1:9} {1:4}
1 1
…
1 𝑑
1 𝑤
𝑖= 1 2 3 4 5 6
EXAMPLES
Min-Heap
4 2 6 9 3 4
𝑘=5 {1:3}
𝑖 6
𝑏= =
𝑘 5
{1:2} {1:6}
ℎ𝑑 ℎ2 ℎ1
{1:9} {1:4}
1 1
…
1 𝑑
1 𝑤
𝑖= 1 2 3 4 5 6
EXAMPLES
Min-Heap
4 2 6 9 3 4
𝑘=5 {1:3}
𝑖 6
𝑏= =
𝑘 5
{1:2} {1:6}
ℎ𝑑 ℎ2 ℎ1
{1:9} {1:4}
2 1
…
2 𝑑
1 𝑤
𝑖= 1 2 3 4 5 6
EXAMPLES
Min-Heap
4 2 6 9 3 4
𝑘=5 {2:4}
𝑖 6
𝑏= =
𝑘 5
ℎ𝑑 ℎ2 ℎ1
2 1
…
2 𝑑
1 𝑤
𝑖= 79
EXAMPLES
Min-Heap
… 2
𝑘=5 {16:4}
𝑖 79
𝑏= = = 15.8
𝑘 5
{20:9} {23:6}
ℎ𝑑 ℎ1 ℎ2
16 1
18
…
15 𝑑
1 𝑤
𝑖= 79
EXAMPLES
Min-Heap
… 2
𝑘=5 {16:4}
𝑖 79
𝑏= = = 15.8
𝑘 5
{20:9} {23:6}
ℎ𝑑 ℎ1 ℎ2
17 1
19
…
16 𝑑
1 𝑤
𝑖= 79
EXAMPLES
Min-Heap
… 2
𝑘=5 {16:2}
𝑖 79
𝑏= = = 15.8
𝑘 5
{16:4} {23:6}
ℎ𝑑 ℎ1 ℎ2
{20:9}
17 1
19
…
16 𝑑
1 𝑤
𝑖= 79 80 81
EXAMPLES
Min-Heap
… 2 1 2
𝑘=5 {16:2}
𝑖 80
𝑏= = = 16
𝑘 5
{16:4} {23:6}
ℎ1 ℎ𝑑 ℎ2
{20:9}
3 1
…
4 𝑑
1 𝑤
𝑖= 79 80 81
EXAMPLES
Min-Heap
… 2 1 9
𝑘=5 {16:2}
𝑖 81
𝑏= = = 16.2
𝑘 5
{16:4} {23:6}
ℎ𝑑 ℎ1 ℎ2
{20:9}
20 1
24
…
25 𝑑
1 𝑤
𝑖= 79 80 81
EXAMPLES
Min-Heap
… 2 1 9
𝑘=5 {16:2}
𝑖 81
𝑏= = = 16.2
𝑘 5
{16:4} {23:6}
ℎ𝑑 ℎ1 ℎ2
{20:9}
21 1
25
…
26 𝑑
1 𝑤
𝑖= 79 80 81
EXAMPLES
Min-Heap
… 2 1 9
𝑘=5 {21:9}
𝑖 81
𝑏= = = 16.2
𝑘 5
{23:6}
ℎ𝑑 ℎ1 ℎ2
21 1
25
…
26 𝑑
1 𝑤
Analysis
• Because 𝑛 is unknown, possible heavy hitters are calculated
and stored every new item comes in
• Maintaining the heap requires extra 𝑂 log 𝑘 = 𝑂 log 1Τ𝜀
time
AMS Sketch : Estimate
Second Moment
Dissanayaka Mudiyanselage Emil Manupa Karunaratne
The Second Moment
• Stream :
• The Second Moment :
• The trivial solution would be : maintain a histogram of size n and get the
sum of squares
• Its not feasible maintain that large array, therefore we intend to find a
approximation algorithm to achieve sub-linear space complexity with
bounded errors
• The algorithm will give an estimate within ε relative error with δ failure
probability. (Two Parameters)
The Method
j
+g1(j)
+g2(j)
d rows
+gd-1(j)
+gd(j)
d rows
+gd-1(j)
+gd(j)
d = 8log 1/δ
+gk(j)
1
• Still the failure probability is is linear in over .
𝑤
What guarantee can we give about the accuracy ?
• We had d number of hash functions, that produce R1, R2, …. Rd
estimates.
• The Median being wrong → Half of the estimates are wrong
• These are independent d estimates, like toin-cosses that have
exponentially decaying probability to get the same outcome.
• They have stronger bounds, Chernoff Bounds :
4
• 𝜇 = 𝑑 #𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑠 ∗ (𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑝𝑟𝑜𝑏. )
3
𝑑 𝑑
• 𝑒𝑟𝑟𝑜𝑟 𝑖𝑠 𝑎𝑤𝑎𝑦 𝑓𝑟om mean ∶
2 4
•
Space and Time Complexity
• E.g. In order to achieve e-10 of tightly bounded accuracy, only 8 * 10
= 80 rows required
• Space complexity is O(log(𝛿)).
• Time complexity will be explained later along with the application
AMS Sketch and Applications
Sapumal Ahangama
Hash functions
• ℎk maps the input domain uniformly to 1,2, … 𝑤 buckets
• ℎ𝑘 should be a pairwise independent hash functions, to cancel
out product terms
– Ex: family of ℎ 𝑥 = 𝑎𝑥 + 𝑏 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 𝑤
– For a and b chosen from prime field 𝑝, 𝑎 ≠ 0
Hash functions
• 𝑔𝑘 maps elements from domain uniformly onto {−1, +1}
• 𝑔𝑘 should be four-wise independent
• g 𝑥 = 2 𝑎𝑥 3 + 𝑏𝑥 2 + 𝑐𝑥 + 𝑑 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 2 − 1
– for 𝑎, 𝑏, 𝑐, 𝑑 chosen uniformly from prime field 𝑝.
Hash functions
• These hash functions can be computed very quickly, faster even than
more familiar (cryptographic) hash functions
• For scenarios which require very high throughput, efficient
implementations are available for hash functions,
– Based on optimizations for particular values of p, and partial
precomputations
d = 8log 1/δ
+gk(j)
Applications - Inner product
• AMS sketch can be used to estimate the inner-product
between a pair of vectors
• Given two frequency distributions 𝑓 𝑎𝑛𝑑 𝑓′
𝑀
𝑓. 𝑓 ′ = 𝑓 𝑖 ∗ 𝑓 ′ (𝑖)
𝑖=1
• AMS sketch based estimator is an unbiased estimator for the
inner product of the vectors
Inner Product
• Two sketches 𝐶𝑀 and 𝐶𝑀’
• Formed with the same parameters and using the same hash
functions (same 𝑤, 𝑑, ℎ𝑘 , 𝑔𝑘 )
• The row estimate is the inner product of the rows,
𝑤
𝐶𝑀 𝑘, 𝑖 ∗ 𝐶𝑀′[𝑘, 𝑖]
𝑖=1
Inner Product
• Expanding
𝑤
𝐶𝑀 𝑘, 𝑖 ∗ 𝐶𝑀′[𝑘, 𝑖]
𝑖=1
• Shows that the estimate gives 𝑓 · 𝑓′ with additional cross-
terms due to collisions of items under ℎ𝑘
• The expectation of these cross terms is zero
– Over the choice of the hash functions, as the function 𝑔𝑘 is equally
likely to add as to subtract any given term.
Inner Product – Join size estimation
• Inner product has a natural interpretation, as the size of the
equi-join between two relations…
• In SQL,
SELECT COUNT(*) FROM D, D’ WHERE D.id =
D’.id
Example
UPDATE(23, 1)
23
h1 h2 h3
1 2 3 4 5 6 7 8
1 0 0 0 0 0 0 0 0
d=3 2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
w=8 87
Example
UPDATE(23, 1)
23
h1 h2 h3
ℎ1 = 3 ℎ2 = 1 ℎ3 = 7
𝑔1 = −1 𝑔2 = −1 𝑔3 = +1
1 2 3 4 5 6 7 8
1 0 0 -1 0 0 0 0 0
d=3 2 -1 0 0 0 0 0 0 0
3 0 0 0 0 0 0 +1 0
w=8 88
Example
UPDATE(99, 2)
99
h1 h2 h3
1 2 3 4 5 6 7 8
1 0 0 -1 0 0 0 0 0
d=3 2 -1 0 0 0 0 0 0 0
3 0 0 0 0 0 0 +1 0
w=8 89
Example
UPDATE(99, 2)
99
h1 h2 h3
ℎ1 = 5 ℎ2 = 1 ℎ3 = 3
𝑔1 = +1 𝑔2 = −1 𝑔3 = +1
1 2 3 4 5 6 7 8
1 0 0 -1 0 0 0 0 0
d=3 2 -1 0 0 0 0 0 0 0
3 0 0 0 0 0 0 +1 0
w=8 90
Example
UPDATE(99, 2)
99
h1 h2 h3
ℎ1 = 5 ℎ2 = 1 ℎ3 = 3
𝑔1 = +1 𝑔2 = −1 𝑔3 = +1
1 2 3 4 5 6 7 8
1 0 0 -1 0 +2 0 0 0
d=3 2 -3 0 0 0 0 0 0 0
3 0 0 +2 0 0 0 +1 0
w=8 91