0% found this document useful (0 votes)
67 views

Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015

The document summarizes algorithms for processing data streams with limited memory. It begins by introducing the stream model and objectives of computing functions over data streams using sublinear memory. It then overviews various streaming algorithms including the Count-Min Sketch, Bloom Filter, and AMS Sketch. The document specifically describes the Bloom Filter algorithm which uses hash functions to probabilistically determine set membership with false positives but no false negatives using sublinear space.

Uploaded by

vel.sakthi3152
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Streaming Algorithms: CS6234 Advanced Algorithms February 10 2015

The document summarizes algorithms for processing data streams with limited memory. It begins by introducing the stream model and objectives of computing functions over data streams using sublinear memory. It then overviews various streaming algorithms including the Count-Min Sketch, Bloom Filter, and AMS Sketch. The document specifically describes the Bloom Filter algorithm which uses hash functions to probabilistically determine set membership with false positives but no false negatives using sublinear space.

Uploaded by

vel.sakthi3152
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 90

Streaming Algorithms

CS6234 Advanced Algorithms


February 10 2015

1
The stream model
• Data sequentially enters at a rapid rate from one or more inputs

• We cannot store the entire stream

• Processing in real-time

• Limited memory (usually sub linear in the size of the stream)

• Goal: Compute a function of stream, e.g., median, number of


distinct elements, longest increasing sequence

Approximate answer is usually preferable

2
Overview
Counting bits with DGIM algorithm

Bloom Filter

Count-Min Sketch

Approximate Heavy Hitters

AMS Sketch

AMS Sketch Applications

3
Counting bits with DGIM
algorithm
Presented by
Dmitrii Kharkovskii

4
Sliding windows
• A useful model : queries are about a window of length N

• The N most recent elements received (or last N time units)

• Interesting case: N is still so large that it cannot be stored

• Or, there are so many streams that windows for all cannot be stored

5
Problem description
• Problem
• Given a stream of 0’s and 1’s
• Answer queries of the form “how many 1’s in the last k bits?” where k ≤ N
• Obvious solution
• Store the most recent N bits (i.e., window size = N)
• When a new bit arrives, discard the N +1st bit
• Real Problem
• Slow ‐ need to scan k‐bits to count
• What if we cannot afford to store N bits?
• Estimate with an approximate answer

6
Datar-Gionis-Indyk-Motwani Algorithm (DGIM)
 Overview

• Approximate answer

• Uses N)  of memory

• Performance guarantee: error no more than 50%

• Possible to decrease error to any fraction 𝜀 > 0 with N) memory

• Possible to generalize for the case of positive integer stream

7
Main idea of the algorithm

Represent the window as a set of exponentially growing non-overlapping buckets

8
Timestamps
• Each bit in the stream has a timestamp  - the position in the stream from the
beginning.

• Record timestamps modulo N (window size) - use o(log N) bits

• Store the most recent timestamp to identify the position of any other bit  in the
window  

9
Buckets
 • Each bucket has two components:

• Timestamp of the most recent end. Needs N) bits

• Size of the bucket - the number of ones in it.

• Size is always .

• To store j we need N) bits

• Each bucket needs N) bits

10
Representing the stream by buckets
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• Buckets do not overlap.
• There are one or two buckets of any given size, up to some maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left (back in time).

11
Updating buckets when a new bit arrives
• Drop the last bucket if it has no overlap with the window

• If the current bit is zero, no changes are needed

• If the current bit is one

• Create a new bucket with it. Size = 1, timestamp = current time modulo N.

• If there are 3 buckets of size 1, merge two oldest into one of size 2.

• If there are 3 buckets of size 2, merge two oldest into one of size 4.

• ...

12
Example of updating process

13
Query Answering
How many ones are in the most recent k bits?

• Find all buckets overlapping with last k bits

• Sum the sizes of all but the oldest one


Ans = 1 + 1 + 2 + 4 + 4 + 8 + 8/2 = 24
• Add the half of the size of the oldest one

14
k
Memory requirements
 

15
Performance guarantee
 • Suppose the last bucket has size .

• By taking half of it, maximum error is

• At least one bucket of every size less than

• The true sum is at least 1+ 2 + 4 + … +  = - 1

• The first bit of the last bucket is always equal to 1.

• Error is at most 50%

16
References

J. Leskovic, A. Rajamaran, J. Ulmann. “Mining of Massive Datasets”.


Cambridge University Press 

18
Bloom Filter

Presented by-
Naheed Anjum Arafat

19
Motivation:
The “Set Membership” Problem
• x: An Element
Streaming Algorithm:
• S: A Set of elements (Finite) • Limited Space/item
• Limited Processing time/item
• Input: x, S • Approximate answer based on a summary/sketch
of the data stream in the memory.
• Output:
• True (if x in S)
• False (if x not in S)

Solution: Binary Search on an array of size |S|. Runtime Complexity: O(log|S|)

20
Bloom Filter

•  Consists of
• vector of n Boolean values, initially all set false (Complexity:- O(n) )
• k independent and uniform hash functions, , … ,
each outputs a value within the range {0, 1, … , n-1}

F F F F F F F F F F
0 1 2 3 4 5 6 7 8 9

n = 10
21
Bloom Filter
• For each element sϵS, the Boolean value at positions ,
, … •,  are set true.
• Complexity of Insertion:- O(k)

𝑠1
=1 =6
=4
 
F TF F F FT F TF F F F
0 1 2   3 4 5 6   7 8 9
 

k=3
22
Bloom Filter
• For each element sϵS, the Boolean value at positions ,
, … •,  are set true.
Note: A particular Boolean value may
be set to True several times.
𝑠1 =4
𝑠2
=7 =9
   
F T F F T F T TF F FT
 
0 1 2 3 4 5 6 7 8 9
   

k=3
23
Algorithm to Approximate Set Membership Query
Input: x ( may/may not be an element)
Runtime Complexity:- O(k)
Output: Boolean
For all i ϵ {0,1,…,k-1}
if hi(x) is False
return False
return True

𝑠1 𝑠2
   
F T F F T F T T F T
0 1 2 3 4 5 6 7 8 9

= S1 = S3 k=3
24
Algorithm to Approximate Set Membership Query

False Positive!!

𝑠1 =6
𝑠2
=4
=1 =4    = 9
=7
 
F T F F T F T  
T F T
0 1   2 3  4 5 6 7 8 9  
 

  =6
=1
𝑥 =9
k=3
25
Error Types

• False Negative – Answering “is not there” on an element which “is there”
• Never happens for Bloom Filter

• False Positive – Answering “is there” for an element which “is not there”
• Might happens. How likely?

26
Probability of false positives
S1 S2

F T F T F F T F T F

n = size of table
m = number of items
k = number of hash functions
 Consider a particular bit 0 <= j <= n-1
Probability that does not set bit j after hashing only 1 item:
Probability that does not set bit j after hashing m items:

27
Probability of false positives
S1 S2

F T F T F F T F T F

n = size of table
m = number of items
k = number of hash functions
 Probability that none of the hash functions set bit j after hashing m items:

We know that,
=

28
Probability of false positives
S1 S2

F T F T F F T F T F

n = size of table
m = number of items Approximate
Probability of
k = number of hash functions False Positive

 Probability that bit j is not set


The prob. of having all k bits of a new element already set

For a fixed m, n which value of k will minimize this bound?  kopt =


Bit per item
 The probability of False Positive
29
Bloom Filters: cons

• Small false positive probability


• Cannot handle deletions
• Size of the Bit vector has to be set a priori in order to maintain a
predetermined FP-rates :- Resolved in “Scalable Bloom Filter” –
Almeida, Paulo; Baquero, Carlos; Preguica, Nuno; Hutchison, David (2007), "Scalable Bloom
Filters" (PDF), Information Processing Letters 101 (6): 255–261

30
References

• https://en.wikipedia.org/wiki/Bloom_filter
• Graham Cormode,
Sketch Techniques for Approximate Query Processing, ATT Research
• Michael Mitzenmacher, Compressed Bloom Filters, Harvard
University, Cambridge

31
Count-Min Sketch

Erick Purwanto
A0050717L
Motivation Count-Min Sketch
• Implemented in real system
•   – AT&T: network switch to analyze network traffic
using limited memory
– Google: implemented on top of MapReduce
parallel processing infrastructure
• Simple and used to solve other problems
– Heavy Hitters by Joseph
– Second Moment , AMS Sketch by Manupa
– Inner Product, Self Join by Sapumal
Frequency Query
• Given a stream of data vector of length , and
•   update (increment) operation,
– we want to know at each time, what is the
frequency of item
– assume frequency 𝑥… 𝑗

• Trivial if we have count array      

– we want sublinear space


– probabilistically approximately correct
Count-Min Sketch
• Assumption:
•   – family of –independent hash function
– sample hash functions

𝑥… 𝑗 h𝑖 : [ 1 , 𝑚 ] →[ 1 , 𝑤]

   
   
1 h𝑖 ( 𝑗) 𝑤

• Use: indep. hash func. and integer array CM[]


Count-Min Sketch
•  Algorithm to Update:
– Inc: for each row CM[]

𝑥… 𝑗 h1
h2 CM
+1
      +1 1
h𝑑  
 
 
+1
 
𝑑  

  1 𝑤
Count-Min Sketch
•  Algorithm to estimate Frequency Query:
– Count: = min CM[]

𝑗 h1
h2 CM
  1
h𝑑  
 

𝑑  

  1 𝑤
Collision
•• Entry
  is an estimate of the frequency of item at row
– for example,

𝑥… 3 5 5 8 5 2 5

                 
row
1 7 𝑤
• Let : frequency of , and random variable :
 
frequency of all ,
     
Count-Min Sketch Analysis

row
  1 h𝑖 ( 𝑗) 𝑤
• Estimate
 
frequency of  at row :  

•  
Count-Min Sketch Analysis
• Let : approximation error, and set
•   • The expectation of other item contribution:

.
Count-Min Sketch Analysis
• Markov Inequality:
•  
• Probability an estimate far from true value:
Count-Min Sketch Analysis
• Let : failure probability, and set
•  
• Probability final estimate far from true value:
Count-Min Sketch
• Result
 
– dynamic data structure CM, item frequency query
– set and
– with probability at least ,

– sublinear space, does not depend on nor


– running time update and freq. query
Approximate Heavy Hitters

TaeHoon Joseph, Kim


Count-Min Sketch (CMS)
•  takes time

– update values

• takes time

– return the minimum of values


Heavy Hitters Problem
•  Input:
– An array of length with distinct items

• Objective:
– Find all items that occur more than times in the array
• there can be at most such items

• Parameter

Heavy Hitters Problem: Naïve Solution
•  Trivial solution is to use array
1. Store all items and each item’s frequency
2. Find all items that has frequencies
-Heavy Hitters
  Problem (-)
•  Relax Heavy Hitters Problem

• Requires sub-linear space


– cannot solve exact problem
– parameters : and
-Heavy Hitters
  Problem (-)
•1.   Returns every item occurs more than times
2. Returns some items that occur more than times
– Count min sketch
Naïve Solution using CMS
… m-2 m-1 m

… j

h
 
1
h
 
2
h
 
𝑑
1
 


𝑑
 

1
  𝑤
 
Naïve Solution using CMS
•  Query the frequency of all items
– Return items with

– slow
Better Solution
•  Use CMS to store the frequency

• Use a baseline as a threshold at item

• Use MinHeap to store potential heavy hitters at item


– store new items in MinHeap with frequency
– delete old items from MinHeap with frequency
-Heavy Hitters
  Problem (-)
•1.   Returns every item occurs more than times
2. Returns some items that occur more than times
– ,
then
Algorithm Approximate Heavy Hitters
•Input
  stream , parameter

For each item :


1. Update Count Min Sketch
2. Compare the frequency of with
3. if
Insert or update in Min Heap
4. remove any value in Min Heap with frequency

Returns the MinHeap as Heavy Hitters


1
EXAMPLES
Min-Heap
4
 

h
 
𝑑 h2
  h
 
1

1  1


1  𝑑
 1  𝑤
1
EXAMPLES
Min-Heap
4
  {1:4}

h
 
𝑑 h2
  h
 
1

1  1


1  𝑑
 1  𝑤
1 2 3 4 5
EXAMPLES
Min-Heap
4 2 6 9 3
  {1:3}

{1:2} {1:6}

h
 
1 h
 
𝑑 h
 
2
{1:9} {1:4}

1  1


1  𝑑
 1  𝑤
1 2 3 4 5 6
EXAMPLES
Min-Heap
4 2 6 9 3 4
  {1:3}

{1:2} {1:6}

h
 
𝑑 h
 
2 h
 
1
{1:9} {1:4}

1  1


1  𝑑
 1  𝑤
1 2 3 4 5 6
EXAMPLES
Min-Heap
4 2 6 9 3 4
  {1:3}

{1:2} {1:6}

h
 
𝑑 h
 
2 h
 
1
{1:9} {1:4}

2  1


2  𝑑
 1  𝑤
1 2 3 4 5 6
EXAMPLES
Min-Heap
4 2 6 9 3 4
  {2:4}

h
 
𝑑 h
 
2 h
 
1

2  1


2  𝑑
 1  𝑤
79
EXAMPLES
Min-Heap
… 2
  {16:4}

{20:9} {23:6}

h
 
𝑑 h
 
1 h
 
2

16  1

18


15  𝑑
 1  𝑤
79
EXAMPLES
Min-Heap
… 2
  {16:4}

{20:9} {23:6}

h
 
𝑑 h
 
1 h
 
2

17  1

19


16  𝑑
 1  𝑤
79
EXAMPLES
Min-Heap
… 2
  {16:2}

{16:4} {23:6}

h
 
𝑑 h
 
1 h
 
2
{20:9}

17  1

19


16  𝑑
 1  𝑤
79 80 81
EXAMPLES
Min-Heap
… 2 1 2
  {16:2}

{16:4} {23:6}

h
 
1 h
 
𝑑 h
 
2
{20:9}

3  1


4  𝑑
 1  𝑤
79 80 81
EXAMPLES
Min-Heap
… 2 1 9
  {16:2}

{16:4} {23:6}

h
 
𝑑 h
 
1 h
 
2
{20:9}

20  1

24


25  𝑑
 1  𝑤
79 80 81
EXAMPLES
Min-Heap
… 2 1 9
  {16:2}

{16:4} {23:6}

h
 
𝑑 h
 
1 h
 
2
{20:9}

21  1

25


26  𝑑
 1  𝑤
79 80 81
EXAMPLES
Min-Heap
… 2 1 9
  {21:9}

{23:6}

h
 
𝑑 h
 
1 h
 
2

21  1

25


26  𝑑
 1  𝑤
Analysis
• Because
  is unknown, possible heavy hitters are calculated and
stored every new item comes in
• Maintaining the heap requires extra time
AMS Sketch : Estimate
Second Moment
Dissanayaka Mudiyanselage Emil Manupa Karunaratne
The Second Moment
• Stream :
• The Second Moment :

• The trivial solution would be : maintain a histogram of size n and get the sum of
squares
• Its not feasible maintain that large array, therefore we intend to find a
approximation algorithm to achieve sub-linear space complexity with bounded
errors
• The algorithm will give an estimate within ε relative error with δ failure probability.
(Two Parameters)
The Method
j
+g1(j)
+g2(j)

d rows
+gd-1(j)
+gd(j)

• j is the next item in the stream.


• 2-wise independent d hash functions to find the bucket for each row

• After finding the bucket, 4-wise independent d hash functions to decide


inc/dec :
• In a summary :
The Method
j
+g1(j)
+g2(j)

d rows
+gd-1(j)
+gd(j)

•  Calculate row estimate


• Median :
• Choose and , by doing so it will give an estimate with relative error
and failure probability
Why should this method give F2 ?
j

d = 8log 1/δ
+gk(j)

• For kth row :


• Estimate F2 from kth row :
• Each row there would be :
• First part :
• Second part : g(i)g(j) can be +1 or -1 with equal probability, therefore
the expectation is 0.
What guarantee can we give about the
accuracy ?
•  The variance of Rk, a row estimate, is caused by hashing collisions.
• Given the independent nature of the hash functions, we can safely
state the variance is bounded by
• Using Chebyshev Inequality,
• Lets assign,

• Still the failure probability is is linear in over


What guarantee can we give about the
accuracy ?
• We
  had d number of hash functions, that produce R1, R2, …. Rd
estimates.
• The Median being wrong  Half of the estimates are wrong
• These are independent d estimates, like toin-cosses that have
exponentially decaying probability to get the same outcome.
• They have stronger bounds, Chernoff Bounds :




Space and Time Complexity
•  E.g. In order to achieve e-10 of tightly bounded accuracy, only 8 * 10
= 80 rows required
• Space complexity is O(log()).
• Time complexity will be explained later along with the application
AMS Sketch and Applications

Sapumal Ahangama
Hash functions
•   maps the input domain uniformly to buckets
• should be a pairwise independent hash functions, to cancel
out product terms
– Ex: family of
– For a and b chosen from prime field ,
Hash functions
•  maps elements from domain uniformly onto
• should be four-wise independent

• Ex: family of equations



– for chosen uniformly from prime field .
Hash functions
• These hash functions can be computed very quickly, faster even than
more familiar (cryptographic) hash functions
• For scenarios which require very high throughput, efficient
implementations are available for hash functions,
– Based on optimizations for particular values of p, and partial precomputations

– Ref: M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with


applications to second moment estimation. In ACM-SIAM Symposium on
Discrete Algorithms, 2004
Time complexity - Update
• The
  sketch is initialized by picking the hash functions to use,
and initializing the array of counters to all zeros
• For each update operation, the item is mapped to an entry in
each row based on the hash functions , multiplied by the
corresponding value of
• Processing each update therefore takes time
– since each hash function evaluation takes constant time.
Time complexity - Query
• Found
  by taking the sum of the squares of each row of the
sketch in turn, and finds the median of these sums.
– That is for each row k, compute
– Take the median of the d such estimates

• Hence the query time is linear in the size of the sketch,


j

d = 8log 1/δ
+gk(j)
Applications - Inner product
• AMS
  sketch can be used to estimate the inner-product
between a pair of vectors
• Given two frequency distributions

• AMS sketch based estimator is an unbiased estimator for the


inner product of the vectors
Inner Product
• Two
  sketches and
• Formed with the same parameters and using the same hash
functions (same )
• The row estimate is the inner product of the rows,
Inner Product
•  Expanding

• Shows that the estimate gives with additional cross-terms due


to collisions of items under
• The expectation of these cross terms is zero
– Over the choice of the hash functions, as the function is equally
likely to add as to subtract any given term.
Inner Product – Join size estimation
• Inner product has a natural interpretation, as the size of the
equi-join between two relations…
• In SQL,
SELECT COUNT(*) FROM D, D’ WHERE D.id =
D’.id
Example
UPDATE(23, 1)
23

h1 h2 h3

1 2 3 4 5 6 7 8
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
d=3 3 0 0 0 0 0 0 0 0

w=8 87
Example
UPDATE(23, 1)
23

h1 h2 h3
 
3    

1 2 3 4 5 6 7 8
1 0 0 -1 0 0 0 0 0
2 -1 0 0 0 0 0 0 0
d=3 3 0 0 0 0 0 0 +1 0

w=8 88
Example
UPDATE(99, 2)
99

h1 h2 h3

1 2 3 4 5 6 7 8
1 0 0 -1 0 0 0 0 0
2 -1 0 0 0 0 0 0 0
d=3 3 0 0 0 0 0 0 +1 0

w=8 89
Example
UPDATE(99, 2)
99

h1 h2 h3
     

1 2 3 4 5 6 7 8
1 0 0 -1 0 0 0 0 0
2 -1 0 0 0 0 0 0 0
d=3 3 0 0 0 0 0 0 +1 0

w=8 90
Example
UPDATE(99, 2)
99

h1 h2 h3
     

1 2 3 4 5 6 7 8
1 0 0 -1 0 +2 0 0 0
2 -3 0 0 0 0 0 0 0
d=3 3 0 0 +2 0 0 0 +1 0

w=8 91

You might also like