Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015
1
The stream model
• Data sequentially enters at a rapid rate from one or more inputs
• Processing in real-time
2
Overview
Counting bits with DGIM algorithm
Bloom Filter
Count-Min Sketch
AMS Sketch
3
Counting bits with DGIM
algorithm
Presented by
Dmitrii Kharkovskii
4
Sliding windows
• A useful model : queries are about a window of length N
• Or, there are so many streams that windows for all cannot be stored
5
Problem description
• Problem
• Given a stream of 0’s and 1’s
• Answer queries of the form “how many 1’s in the last k bits?” where k ≤ N
• Obvious solution
• Store the most recent N bits (i.e., window size = N)
• When a new bit arrives, discard the N +1st bit
• Real Problem
• Slow ‐ need to scan k‐bits to count
• What if we cannot afford to store N bits?
• Estimate with an approximate answer
6
Datar-Gionis-Indyk-Motwani Algorithm (DGIM)
Overview
• Approximate answer
7
Main idea of the algorithm
8
Timestamps
• Each bit in the stream has a timestamp - the position in the stream from the
beginning.
• Store the most recent timestamp to identify the position of any other bit in the
window
9
Buckets
• Each bucket has two components:
• Size is always .
10
Representing the stream by buckets
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• Buckets do not overlap.
• There are one or two buckets of any given size, up to some maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left (back in time).
11
Updating buckets when a new bit arrives
• Drop the last bucket if it has no overlap with the window
• Create a new bucket with it. Size = 1, timestamp = current time modulo N.
• If there are 3 buckets of size 1, merge two oldest into one of size 2.
• If there are 3 buckets of size 2, merge two oldest into one of size 4.
• ...
12
Example of updating process
13
Query Answering
How many ones are in the most recent k bits?
14
k
Memory requirements
15
Performance guarantee
• Suppose the last bucket has size .
16
References
18
Bloom Filter
Presented by-
Naheed Anjum Arafat
19
Motivation:
The “Set Membership” Problem
• x: An Element
Streaming Algorithm:
• S: A Set of elements (Finite) • Limited Space/item
• Limited Processing time/item
• Input: x, S • Approximate answer based on a summary/sketch
of the data stream in the memory.
• Output:
• True (if x in S)
• False (if x not in S)
20
Bloom Filter
• Consists of
• vector of n Boolean values, initially all set false (Complexity:- O(n) )
• k independent and uniform hash functions, , … ,
each outputs a value within the range {0, 1, … , n-1}
F F F F F F F F F F
0 1 2 3 4 5 6 7 8 9
n = 10
21
Bloom Filter
• For each element sϵS, the Boolean value at positions ,
, … •, are set true.
• Complexity of Insertion:- O(k)
𝑠1
=1 =6
=4
F TF F F FT F TF F F F
0 1 2 3 4 5 6 7 8 9
k=3
22
Bloom Filter
• For each element sϵS, the Boolean value at positions ,
, … •, are set true.
Note: A particular Boolean value may
be set to True several times.
𝑠1 =4
𝑠2
=7 =9
F T F F T F T TF F FT
0 1 2 3 4 5 6 7 8 9
k=3
23
Algorithm to Approximate Set Membership Query
Input: x ( may/may not be an element)
Runtime Complexity:- O(k)
Output: Boolean
For all i ϵ {0,1,…,k-1}
if hi(x) is False
return False
return True
𝑠1 𝑠2
F T F F T F T T F T
0 1 2 3 4 5 6 7 8 9
= S1 = S3 k=3
24
Algorithm to Approximate Set Membership Query
False Positive!!
𝑠1 =6
𝑠2
=4
=1 =4 = 9
=7
F T F F T F T
T F T
0 1 2 3 4 5 6 7 8 9
=6
=1
𝑥 =9
k=3
25
Error Types
• False Negative – Answering “is not there” on an element which “is there”
• Never happens for Bloom Filter
• False Positive – Answering “is there” for an element which “is not there”
• Might happens. How likely?
26
Probability of false positives
S1 S2
F T F T F F T F T F
n = size of table
m = number of items
k = number of hash functions
Consider a particular bit 0 <= j <= n-1
Probability that does not set bit j after hashing only 1 item:
Probability that does not set bit j after hashing m items:
27
Probability of false positives
S1 S2
F T F T F F T F T F
n = size of table
m = number of items
k = number of hash functions
Probability that none of the hash functions set bit j after hashing m items:
We know that,
=
28
Probability of false positives
S1 S2
F T F T F F T F T F
n = size of table
m = number of items Approximate
Probability of
k = number of hash functions False Positive
30
References
• https://en.wikipedia.org/wiki/Bloom_filter
• Graham Cormode,
Sketch Techniques for Approximate Query Processing, ATT Research
• Michael Mitzenmacher, Compressed Bloom Filters, Harvard
University, Cambridge
31
Count-Min Sketch
Erick Purwanto
A0050717L
Motivation Count-Min Sketch
• Implemented in real system
• – AT&T: network switch to analyze network traffic
using limited memory
– Google: implemented on top of MapReduce
parallel processing infrastructure
• Simple and used to solve other problems
– Heavy Hitters by Joseph
– Second Moment , AMS Sketch by Manupa
– Inner Product, Self Join by Sapumal
Frequency Query
• Given a stream of data vector of length , and
• update (increment) operation,
– we want to know at each time, what is the
frequency of item
– assume frequency 𝑥… 𝑗
𝑥… 𝑗 h𝑖 : [ 1 , 𝑚 ] →[ 1 , 𝑤]
1 h𝑖 ( 𝑗) 𝑤
𝑥… 𝑗 h1
h2 CM
+1
+1 1
h𝑑
+1
𝑑
1 𝑤
Count-Min Sketch
• Algorithm to estimate Frequency Query:
– Count: = min CM[]
𝑗 h1
h2 CM
1
h𝑑
𝑑
1 𝑤
Collision
•• Entry
is an estimate of the frequency of item at row
– for example,
𝑥… 3 5 5 8 5 2 5
row
1 7 𝑤
• Let : frequency of , and random variable :
frequency of all ,
Count-Min Sketch Analysis
row
1 h𝑖 ( 𝑗) 𝑤
• Estimate
frequency of at row :
•
Count-Min Sketch Analysis
• Let : approximation error, and set
• • The expectation of other item contribution:
.
Count-Min Sketch Analysis
• Markov Inequality:
•
• Probability an estimate far from true value:
Count-Min Sketch Analysis
• Let : failure probability, and set
•
• Probability final estimate far from true value:
Count-Min Sketch
• Result
– dynamic data structure CM, item frequency query
– set and
– with probability at least ,
– update values
• takes time
• Objective:
– Find all items that occur more than times in the array
• there can be at most such items
• Parameter
–
Heavy Hitters Problem: Naïve Solution
• Trivial solution is to use array
1. Store all items and each item’s frequency
2. Find all items that has frequencies
-Heavy Hitters
Problem (-)
• Relax Heavy Hitters Problem
… j
h
1
h
2
h
𝑑
1
…
𝑑
1
𝑤
Naïve Solution using CMS
• Query the frequency of all items
– Return items with
– slow
Better Solution
• Use CMS to store the frequency
h
𝑑 h2
h
1
1 1
…
1 𝑑
1 𝑤
1
EXAMPLES
Min-Heap
4
{1:4}
h
𝑑 h2
h
1
1 1
…
1 𝑑
1 𝑤
1 2 3 4 5
EXAMPLES
Min-Heap
4 2 6 9 3
{1:3}
{1:2} {1:6}
h
1 h
𝑑 h
2
{1:9} {1:4}
1 1
…
1 𝑑
1 𝑤
1 2 3 4 5 6
EXAMPLES
Min-Heap
4 2 6 9 3 4
{1:3}
{1:2} {1:6}
h
𝑑 h
2 h
1
{1:9} {1:4}
1 1
…
1 𝑑
1 𝑤
1 2 3 4 5 6
EXAMPLES
Min-Heap
4 2 6 9 3 4
{1:3}
{1:2} {1:6}
h
𝑑 h
2 h
1
{1:9} {1:4}
2 1
…
2 𝑑
1 𝑤
1 2 3 4 5 6
EXAMPLES
Min-Heap
4 2 6 9 3 4
{2:4}
h
𝑑 h
2 h
1
2 1
…
2 𝑑
1 𝑤
79
EXAMPLES
Min-Heap
… 2
{16:4}
{20:9} {23:6}
h
𝑑 h
1 h
2
16 1
18
…
15 𝑑
1 𝑤
79
EXAMPLES
Min-Heap
… 2
{16:4}
{20:9} {23:6}
h
𝑑 h
1 h
2
17 1
19
…
16 𝑑
1 𝑤
79
EXAMPLES
Min-Heap
… 2
{16:2}
{16:4} {23:6}
h
𝑑 h
1 h
2
{20:9}
17 1
19
…
16 𝑑
1 𝑤
79 80 81
EXAMPLES
Min-Heap
… 2 1 2
{16:2}
{16:4} {23:6}
h
1 h
𝑑 h
2
{20:9}
3 1
…
4 𝑑
1 𝑤
79 80 81
EXAMPLES
Min-Heap
… 2 1 9
{16:2}
{16:4} {23:6}
h
𝑑 h
1 h
2
{20:9}
20 1
24
…
25 𝑑
1 𝑤
79 80 81
EXAMPLES
Min-Heap
… 2 1 9
{16:2}
{16:4} {23:6}
h
𝑑 h
1 h
2
{20:9}
21 1
25
…
26 𝑑
1 𝑤
79 80 81
EXAMPLES
Min-Heap
… 2 1 9
{21:9}
{23:6}
h
𝑑 h
1 h
2
21 1
25
…
26 𝑑
1 𝑤
Analysis
• Because
is unknown, possible heavy hitters are calculated and
stored every new item comes in
• Maintaining the heap requires extra time
AMS Sketch : Estimate
Second Moment
Dissanayaka Mudiyanselage Emil Manupa Karunaratne
The Second Moment
• Stream :
• The Second Moment :
• The trivial solution would be : maintain a histogram of size n and get the sum of
squares
• Its not feasible maintain that large array, therefore we intend to find a
approximation algorithm to achieve sub-linear space complexity with bounded
errors
• The algorithm will give an estimate within ε relative error with δ failure probability.
(Two Parameters)
The Method
j
+g1(j)
+g2(j)
d rows
+gd-1(j)
+gd(j)
d rows
+gd-1(j)
+gd(j)
d = 8log 1/δ
+gk(j)
•
•
•
Space and Time Complexity
• E.g. In order to achieve e-10 of tightly bounded accuracy, only 8 * 10
= 80 rows required
• Space complexity is O(log()).
• Time complexity will be explained later along with the application
AMS Sketch and Applications
Sapumal Ahangama
Hash functions
• maps the input domain uniformly to buckets
• should be a pairwise independent hash functions, to cancel
out product terms
– Ex: family of
– For a and b chosen from prime field ,
Hash functions
• maps elements from domain uniformly onto
• should be four-wise independent
d = 8log 1/δ
+gk(j)
Applications - Inner product
• AMS
sketch can be used to estimate the inner-product
between a pair of vectors
• Given two frequency distributions
h1 h2 h3
1 2 3 4 5 6 7 8
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
d=3 3 0 0 0 0 0 0 0 0
w=8 87
Example
UPDATE(23, 1)
23
h1 h2 h3
3
1 2 3 4 5 6 7 8
1 0 0 -1 0 0 0 0 0
2 -1 0 0 0 0 0 0 0
d=3 3 0 0 0 0 0 0 +1 0
w=8 88
Example
UPDATE(99, 2)
99
h1 h2 h3
1 2 3 4 5 6 7 8
1 0 0 -1 0 0 0 0 0
2 -1 0 0 0 0 0 0 0
d=3 3 0 0 0 0 0 0 +1 0
w=8 89
Example
UPDATE(99, 2)
99
h1 h2 h3
1 2 3 4 5 6 7 8
1 0 0 -1 0 0 0 0 0
2 -1 0 0 0 0 0 0 0
d=3 3 0 0 0 0 0 0 +1 0
w=8 90
Example
UPDATE(99, 2)
99
h1 h2 h3
1 2 3 4 5 6 7 8
1 0 0 -1 0 +2 0 0 0
2 -3 0 0 0 0 0 0 0
d=3 3 0 0 +2 0 0 0 +1 0
w=8 91