0% found this document useful (0 votes)
6 views90 pages

DGIM

The document discusses streaming algorithms, focusing on the DGIM algorithm for counting bits in a binary stream and the Bloom Filter for approximate set membership. It highlights the challenges of processing data in real-time with limited memory and presents methods for estimating answers to queries without storing the entire data stream. Additionally, it covers the Count-Min Sketch algorithm for frequency queries, emphasizing its applications in real systems and the need for sublinear space solutions.

Uploaded by

Raj Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views90 pages

DGIM

The document discusses streaming algorithms, focusing on the DGIM algorithm for counting bits in a binary stream and the Bloom Filter for approximate set membership. It highlights the challenges of processing data in real-time with limited memory and presents methods for estimating answers to queries without storing the entire data stream. Additionally, it covers the Count-Min Sketch algorithm for frequency queries, emphasizing its applications in real systems and the need for sublinear space solutions.

Uploaded by

Raj Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Streaming Algorithms

1
The stream model
• Data sequentially enters at a rapid rate from one or more inputs

• We cannot store the entire stream

• Processing in real-time

• Limited memory (usually sub linear in the size of the stream)

• Goal: Compute a function of stream, e.g., median, number of


distinct elements, longest increasing sequence

Approximate answer is usually preferable

2
Counting bits with
DGIM algorithm

3
Sliding windows
• A useful model : queries are about a window of length N

• The N most recent elements received (or last N time units)

• Interesting case: N is still so large that it cannot be stored

• Or, there are so many streams that windows for all cannot be stored

4
Problem description
• Problem
• Given a stream of 0’s and 1’s
• Answer queries of the form “how many 1’s in the last k bits?” where k ≤ N
• Obvious solution
• Store the most recent N bits (i.e., window size = N)
• When a new bit arrives, discard the N +1st bit
• Real Problem
• Slow ‐ need to scan k‐bits to count
• What if we cannot afford to store N bits?
• Estimate with an approximate answer

5
Datar-Gionis-Indyk-Motwani Algorithm (DGIM)
Overview

• Approximate answer

• Uses 𝑂(𝑙𝑜𝑔2 N) of memory

• Performance guarantee: error no more than 50%

• Possible to decrease error to any fraction 𝜀 > 0 with 𝑂(𝑙𝑜𝑔2 N) memory

• Possible to generalize for the case of positive integer stream

6
Main idea of the algorithm

Represent the window as a set of exponentially growing non-overlapping buckets

7
Timestamps
• Each bit in the stream has a timestamp - the position in the stream from the
beginning.

• Record timestamps modulo N (window size) - use o(log N) bits

• Store the most recent timestamp to identify the position of any other bit in the
window

8
Buckets
• Each bucket has two components:

• Timestamp of the most recent end. Needs 𝑂(𝑙𝑜𝑔 N) bits

• Size of the bucket - the number of ones in it.

• Size is always 2𝑗 .

• To store j we need 𝑂(𝑙𝑜𝑔 𝑙𝑜𝑔 N) bits

• Each bucket needs 𝑂(𝑙𝑜𝑔 N) bits

9
RULES FOR FORMING THE BUCKETS:
1. The right side of the bucket should always start with 1. (if it starts with a 0,it is to be
neglected) E.g. · 1001011 → a bucket of size 4 ,having four 1’s and starting with 1 on
it’s right end.

2. Every bucket should have at least one 1, else no bucket can be formed.

3. All buckets should be in powers of 2.

4. The buckets cannot decrease in size as we move to the left. (move in increasing order
towards left)

10
Representing the stream by buckets
• The right end of a bucket is always a position with a 1.
• Every position with a 1 is in some bucket.
• Buckets do not overlap.
• There are one or two buckets of any given size, up to some maximum size.
• All sizes must be a power of 2.
• Buckets cannot decrease in size as we move to the left (back in time).

11
Updating buckets when a new bit arrives
• Drop the last bucket if it has no overlap with the window

• If the current bit is zero, no changes are needed

• If the current bit is one

• Create a new bucket with it. Size = 1, timestamp = current time modulo N.

• If there are 3 buckets of size 1, merge two oldest into one of size 2.

• If there are 3 buckets of size 2, merge two oldest into one of size 4.

• ...

12
Example of updating process

13
Query Answering
How many ones are in the most recent k bits?

• Find all buckets overlapping with last k bits

• Sum the sizes of all but the oldest one


Ans = 1 + 1 + 2 + 4 + 4 + 8 + 8/2 = 24
• Add the half of the size of the oldest one

14
k
Memory requirements

15
Performance guarantee
• Suppose the last bucket has size 2𝑟 .

• By taking half of it, maximum error is 2𝑟−1

• At least one bucket of every size less than 2𝑟

• The true sum is at least 1+ 2 + 4 + … + 2𝑟−1 = 2𝑟 - 1

• The first bit of the last bucket is always equal to 1.

• Error is at most 50%

16
References

J. Leskovic, A. Rajamaran, J. Ulmann. “Mining of Massive Datasets”.


Cambridge University Press

18
Bloom Filter

Presented by-
Naheed Anjum Arafat

19
Motivation:
The “Set Membership” Problem
• x: An Element
Streaming Algorithm:
• S: A Set of elements (Finite) • Limited Space/item
• Limited Processing time/item
• Input: x, S • Approximate answer based on a summary/sketch
of the data stream in the memory.
• Output:
• True (if x in S)
• False (if x not in S)

Solution: Binary Search on an array of size |S|. Runtime Complexity: O(log|S|)

20
Bloom Filter

• Consists of
• vector of n Boolean values, initially all set false (Complexity:- O(n) )
• k independent and uniform hash functions, ℎ0 , ℎ1 , … , ℎk−1
each outputs a value within the range {0, 1, … , n-1}

F F F F F F F F F F
0 1 2 3 4 5 6 7 8 9

n = 10
21
Bloom Filter
• For each element sϵS, the Boolean value at positions
ℎ0 𝑠 , ℎ1 𝑠 , … , ℎ𝑘−1 𝑠 are set true.
• Complexity of Insertion:- O(k)

𝑠1
ℎ0 𝑠1 = 1 ℎ2 𝑠1 = 6
ℎ1 𝑠1 = 4

F TF F F FT F TF F F F
0 1 2 3 4 5 6 7 8 9

k=3
22
Bloom Filter
• For each element sϵS, the Boolean value at positions
ℎ0 𝑠 , ℎ1 𝑠 , … , ℎ𝑘−1 𝑠 are set true.
Note: A particular Boolean value may
be set to True several times.
𝑠1 ℎ0 𝑠2 = 4
𝑠2
ℎ1 𝑠2 = 7 ℎ2 𝑠2 = 9

F T F F T F T TF F FT
0 1 2 3 4 5 6 7 8 9

k=3
23
Algorithm to Approximate Set Membership Query
Input: x ( may/may not be an element)
Runtime Complexity:- O(k)
Output: Boolean
For all i ϵ {0,1,…,k-1}
if hi(x) is False
return False
return True
𝑠1 𝑠2

F T F F T F T T F T
0 1 2 3 4 5 6 7 8 9

𝑥 = S1 𝑥 = S3 k=3
24
Algorithm to Approximate Set Membership Query

False Positive!!

𝑠1 ℎ2 𝑠1 = 6 𝑠2
ℎ0 𝑠2 = 4
ℎ0 𝑠1 = 1 ℎ1 𝑠1 = 4 ℎ2 𝑠2 = 9
ℎ1 𝑠2 = 7

F T F F T F T T F T
0 1 2 3 4 5 6 7 8 9

ℎ1 𝑥 = 6
ℎ2 𝑥 = 1 ℎ0 𝑥 = 9
𝑥
k=3
25
Error Types

• False Negative – Answering “is not there” on an element which “is there”
• Never happens for Bloom Filter

• False Positive – Answering “is there” for an element which “is not there”
• Might happens. How likely?

26
Probability of false positives
S1 S2

F T F T F F T F T F

n = size of table
m = number of items
k = number of hash functions
Consider a particular bit 0 <= j <= n-1
Probability that ℎ𝑖 𝑥 does not set bit j after hashing only 1 item:
1
𝑃 ℎ𝑖 𝑥 ≠ 𝑗 = 1 − 𝑛
Probability that ℎ𝑖 𝑥 does not set bit j after hashing m items:
1 𝑚
𝑃 ∀𝑥 𝑖𝑛 {𝑆1 , 𝑆2 , … , 𝑆𝑚 }: ℎ𝑖 𝑥 ≠ 𝑗 = 1 − 𝑛
27
Probability of false positives
S1 S2

F T F T F F T F T F

n = size of table
m = number of items
k = number of hash functions
Probability that none of the hash functions set bit j after hashing m items:
𝑘𝑚
1
𝑃 ∀𝑥 𝑖𝑛 𝑆1 , 𝑆2 , … , 𝑆𝑚 , ∀𝑖 𝑖𝑛 1,2, … , 𝑘 : ℎ𝑖 (𝑥) ≠ 𝑗 = 1 −
𝑛
1 𝑛 1
We know that, 1 − ≈ e = 𝑒 −1
𝑛
𝑘 𝑚 Τ𝑛
1 𝑘𝑚 1 𝑛 𝑘 𝑚 Τ𝑛
⇒ 1− = 1− ≈ 𝑒 −1 = 𝑒 −𝑘𝑚Τ𝑛 28
𝑛 𝑛
Probability of false positives
S1 S2

F T F T F F T F T F

n = size of table
m = number of items Approximate
Probability of
k = number of hash functions False Positive

Probability that bit j is not set 𝑃 𝐵𝑖𝑡 𝑗 = 𝐹 = 𝑒 −𝑘𝑚Τ𝑛


𝒌
The prob. of having all k bits of a new element already set = 𝟏 − 𝒆− 𝒌𝒎Τ𝒏

For a fixed m, n which value of k will minimize this bound? kopt = log𝑒 2 ⋅ 𝑛Τ𝑚
𝑛 Bit per item
1
The probability of False Positive = ( )𝑘𝑜𝑝𝑡 = (0.6185) 𝑚
2 29
Bloom Filters: cons

• Small false positive probability


• Cannot handle deletions
• Size of the Bit vector has to be set a priori in order to maintain a
predetermined FP-rates :- Resolved in “Scalable Bloom Filter” –
Almeida, Paulo; Baquero, Carlos; Preguica, Nuno; Hutchison, David (2007), "Scalable
Bloom Filters" (PDF), Information Processing Letters 101 (6): 255–261

30
References

• https://en.wikipedia.org/wiki/Bloom_filter
• Graham Cormode, Sketch Techniques for Approximate Query
Processing, ATT Research
• Michael Mitzenmacher, Compressed Bloom Filters, Harvard
University, Cambridge

31
Count-Min Sketch

Erick Purwanto
A0050717L
Motivation Count-Min Sketch
• Implemented in real system
– AT&T: network switch to analyze network traffic
using limited memory
– Google: implemented on top of MapReduce
parallel processing infrastructure

• Simple and used to solve other problems


– Heavy Hitters by Joseph
– Second Moment 𝐹2 , AMS Sketch by Manupa
– Inner Product, Self Join by Sapumal
Frequency Query
• Given a stream of data vector 𝑥 of length 𝑛, 𝑥𝑖 ∈
[1, 𝑚] and update (increment) operation,
– we want to know at each time, what is 𝑓𝑗 the
frequency of item 𝑗
𝑥… 𝑗
– assume frequency 𝑓𝑗 ≥ 0

• Trivial if we have count array [1, 𝑚]


– we want sublinear space
– probabilistically approximately correct
Count-Min Sketch
• Assumption:
– family of 𝑑–independent hash function 𝐻
– sample 𝑑 hash functions ℎ𝑖 ← 𝐻

𝑥… 𝑗 ℎ𝑖 ∶ 1, 𝑚 → [1, 𝑤]

1 ℎ𝑖 (𝑗) 𝑤

• Use: 𝑑 indep. hash func. and integer array CM[𝑤, 𝑑]


Count-Min Sketch
• Algorithm to Update:
– Inc(𝑗) : for each row 𝑖, CM[𝑖, ℎ𝑖 (𝑗)] += 1

𝑥… 𝑗
ℎ1
ℎ2 CM
+1
1
+1
ℎ𝑑
+1
𝑑
1 𝑤
Count-Min Sketch
• Algorithm to estimate Frequency Query:
– Count(𝑗) : 𝑓ሚ𝑗 = min𝑖 CM[𝑖, ℎ𝑖 (𝑗)]
𝑗
ℎ1
ℎ2 CM
1
ℎ𝑑

𝑑
1 𝑤
Collision
• Entry CM 𝑖, ℎ𝑖 𝑗 is an estimate of the frequency
of item 𝑗 at row 𝑖
– for example, ℎ1 5 = ℎ1 2 = 7

𝑥… 3 5 5 8 5 2 5

row 1
1 7 𝑤

• Let 𝑓𝑗 : frequency of 𝑗, and random variable


𝑋𝑖,𝑗 : frequency of all 𝑘 ≠ 𝑗, ℎ𝑖 𝑘 = ℎ𝑖 (𝑗)
Count-Min Sketch Analysis

row 𝑖
1 ℎ𝑖 (𝑗) 𝑤

• Estimate frequency of 𝑗 at row 𝑖:


𝑓ሚ𝑖,𝑗 = CM 𝑖, ℎ𝑖 𝑗
𝑛

= 𝑓𝑗 + ෍ 𝑓𝑘
𝑘≠𝑗, ℎ𝑖 𝑘 =ℎ𝑖 𝑗

= 𝑓𝑗 + 𝑋𝑖,𝑗
Count-Min Sketch Analysis
• Let 𝜀 : approximation error, and set 𝑤 = 𝑒Τ𝜀
• The expectation of other item contribution:
E[𝑋𝑖,𝑗 ] = σ𝑘≠𝑗 𝑓𝑘 ⋅ Pr[ ℎ𝑖 𝑘 = ℎ𝑖 𝑗 ]

≤ Pr ℎ𝑖 𝑘 = ℎ𝑖 𝑗 ⋅ σ𝑘 𝑓𝑘 .
1
= ⋅ 𝐹1
𝑤
𝜀
= ⋅ 𝐹1
𝑒
Count-Min Sketch Analysis
• Markov Inequality: Pr[ 𝑋 ≥ 𝑘 ∙ E 𝑋 ] ≤ 1Τ𝑘

• Probability an estimate 𝜀 ⋅ 𝐹1 far from true value:

Pr 𝑓ሚ𝑖,𝑗 > 𝑓𝑗 + 𝜀 ∙ 𝐹1 = Pr[ 𝑋𝑖,𝑗 > 𝜀 ∙ 𝐹1 ]

= Pr[ 𝑋𝑖,𝑗 > 𝑒 ⋅ E 𝑋𝑖,𝑗 ]


1

𝑒
Count-Min Sketch Analysis
• Let 𝛿 : failure probability, and set 𝑑 = ln(1Τ𝛿)

• Probability final estimate far from true value:


Pr 𝑓ሚ𝑗 > 𝑓𝑗 + 𝜀 ∙ 𝐹1 = Pr ∀𝑖 ∶ 𝑓ሚ𝑖,𝑗 > 𝑓𝑗 + 𝜀 ∙ 𝐹1

= ( Pr 𝑓ሚ𝑖,𝑗 > 𝑓𝑗 + 𝜀 ∙ 𝐹1 )𝑑
ln(1ൗ𝛿 )
1

𝑒

= 𝛿
Count-Min Sketch
• Result
– dynamic data structure CM, item frequency
query
– set 𝑤 = 𝑒Τ𝜀 and 𝑑 = ln(1Τ𝛿)
– with probability at least 1 − 𝛿,
𝑓ሚ𝑗 ≤ 𝑓𝑗 + 𝜀 ∙ σ𝑘 𝑓𝑘

– sublinear space, does not depend on 𝑛 nor 𝑚


– running time update 𝑂(𝑑) and freq. query 𝑂(𝑑)
Approximate Heavy Hitters

TaeHoon Joseph, Kim


Count-Min Sketch (CMS)
• Inc(𝑗) takes 𝑂 𝑑 time
–𝑂 1×𝑑
– update 𝑑 values

• Count(𝑗) takes 𝑂 𝑑 time


–𝑂 1×𝑑
– return the minimum of 𝑑 values
Heavy Hitters Problem
• Input:
– An array of length 𝑛 with 𝑚 distinct items

• Objective:
𝑛
– Find all items that occur more than times in the array
𝑘
• there can be at most 𝑘 such items

• Parameter
–𝑘
Heavy Hitters Problem: Naïve Solution
• Trivial solution is to use 𝑂 𝑚 array
1. Store all items and each item’s frequency
𝑛
2. Find all 𝑘 items that has frequencies ≥
𝑘
𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻)
• Relax Heavy Hitters Problem

• Requires sub-linear space


– cannot solve exact problem
– parameters : 𝑘 and 𝜖
𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻)
𝑛
1. Returns every item occurs more than times
𝑘
𝑛
2. Returns some items that occur more than − 𝜖 ∙ 𝑛 times
𝑘
– Count min sketch
𝑓ሚ𝑗 ≤ 𝑓𝑗 + 𝜀 ∙ ෍ 𝑓𝑘
𝑘
Naïve Solution using CMS
… m-2 m-1 m

… j

ℎ1
ℎ2
ℎ𝑑
1


𝑑
1 𝑤
Naïve Solution using CMS
• Query the frequency of all 𝑚 items
𝑛
– Return items with Count 𝑗 ≥
𝑘

• 𝑂 𝑚𝑑
– slow
Better Solution
• Use CMS to store the frequency

• Use a baseline 𝑏 as a threshold at 𝑖𝑡ℎ item


𝑖
–𝑏=
𝑘

• Use MinHeap to store potential heavy hitters at 𝑖𝑡ℎ item


– store new items in MinHeap with frequency ≥ 𝑏
– delete old items from MinHeap with frequency < 𝑏
𝜖-Heavy Hitters Problem (𝜖-𝐻𝐻)
𝑛
1. Returns every item occurs more than times
𝑘
𝑛
2. Returns some items that occur more than − 𝜖 ∙ 𝑛 times
𝑘
1
– 𝜖 = 2𝑘 ,
𝑛
then 𝐂𝐨𝐮𝐧𝐭 𝑥 ∈ [ 𝑓𝑥 , 𝑓𝑥 + 2𝑘 ]
– ℎ𝑒𝑎𝑝 𝑠𝑖𝑧𝑒 = 2𝑘
Algorithm Approximate Heavy Hitters
Input stream 𝑥, parameter 𝑘

For each item 𝑗 ∈ 𝑥 :


1. Update Count Min Sketch
2. Compare the frequency of 𝑗 with 𝑏
3. if count ≥ 𝑏
Insert or update 𝑗 in Min Heap
4. remove any value in Min Heap with frequency < 𝑏

Returns the MinHeap as Heavy Hitters


𝑖= 1
EXAMPLES
Min-Heap
4 𝑘=5
𝑖 1
𝑏= =
𝑘 5

ℎ𝑑 ℎ2 ℎ1

1 1


1 𝑑
1 𝑤
𝑖= 1
EXAMPLES
Min-Heap
4 𝑘=5 {1:4}
𝑖 1
𝑏= =
𝑘 5

ℎ𝑑 ℎ2 ℎ1

1 1


1 𝑑
1 𝑤
𝑖= 1 2 3 4 5
EXAMPLES
Min-Heap
4 2 6 9 3
𝑘=5 {1:3}
𝑖 5
𝑏= =
𝑘 5
{1:2} {1:6}

ℎ1 ℎ𝑑 ℎ2
{1:9} {1:4}

1 1


1 𝑑
1 𝑤
𝑖= 1 2 3 4 5 6
EXAMPLES
Min-Heap
4 2 6 9 3 4
𝑘=5 {1:3}
𝑖 6
𝑏= =
𝑘 5
{1:2} {1:6}

ℎ𝑑 ℎ2 ℎ1
{1:9} {1:4}

1 1


1 𝑑
1 𝑤
𝑖= 1 2 3 4 5 6
EXAMPLES
Min-Heap
4 2 6 9 3 4
𝑘=5 {1:3}
𝑖 6
𝑏= =
𝑘 5
{1:2} {1:6}

ℎ𝑑 ℎ2 ℎ1
{1:9} {1:4}

2 1


2 𝑑
1 𝑤
𝑖= 1 2 3 4 5 6
EXAMPLES
Min-Heap
4 2 6 9 3 4
𝑘=5 {2:4}
𝑖 6
𝑏= =
𝑘 5

ℎ𝑑 ℎ2 ℎ1

2 1


2 𝑑
1 𝑤
𝑖= 79
EXAMPLES
Min-Heap
… 2
𝑘=5 {16:4}
𝑖 79
𝑏= = = 15.8
𝑘 5
{20:9} {23:6}

ℎ𝑑 ℎ1 ℎ2

16 1

18


15 𝑑
1 𝑤
𝑖= 79
EXAMPLES
Min-Heap
… 2
𝑘=5 {16:4}
𝑖 79
𝑏= = = 15.8
𝑘 5
{20:9} {23:6}

ℎ𝑑 ℎ1 ℎ2

17 1

19


16 𝑑
1 𝑤
𝑖= 79
EXAMPLES
Min-Heap
… 2
𝑘=5 {16:2}
𝑖 79
𝑏= = = 15.8
𝑘 5
{16:4} {23:6}

ℎ𝑑 ℎ1 ℎ2
{20:9}

17 1

19


16 𝑑
1 𝑤
𝑖= 79 80 81
EXAMPLES
Min-Heap
… 2 1 2
𝑘=5 {16:2}
𝑖 80
𝑏= = = 16
𝑘 5
{16:4} {23:6}

ℎ1 ℎ𝑑 ℎ2
{20:9}

3 1


4 𝑑
1 𝑤
𝑖= 79 80 81
EXAMPLES
Min-Heap
… 2 1 9
𝑘=5 {16:2}
𝑖 81
𝑏= = = 16.2
𝑘 5
{16:4} {23:6}

ℎ𝑑 ℎ1 ℎ2
{20:9}

20 1

24


25 𝑑
1 𝑤
𝑖= 79 80 81
EXAMPLES
Min-Heap
… 2 1 9
𝑘=5 {16:2}
𝑖 81
𝑏= = = 16.2
𝑘 5
{16:4} {23:6}

ℎ𝑑 ℎ1 ℎ2
{20:9}

21 1

25


26 𝑑
1 𝑤
𝑖= 79 80 81
EXAMPLES
Min-Heap
… 2 1 9
𝑘=5 {21:9}
𝑖 81
𝑏= = = 16.2
𝑘 5
{23:6}

ℎ𝑑 ℎ1 ℎ2

21 1

25


26 𝑑
1 𝑤
Analysis
• Because 𝑛 is unknown, possible heavy hitters are calculated
and stored every new item comes in
• Maintaining the heap requires extra 𝑂 log 𝑘 = 𝑂 log 1Τ𝜀
time
AMS Sketch : Estimate
Second Moment
Dissanayaka Mudiyanselage Emil Manupa Karunaratne
The Second Moment
• Stream :
• The Second Moment :

• The trivial solution would be : maintain a histogram of size n and get the
sum of squares
• Its not feasible maintain that large array, therefore we intend to find a
approximation algorithm to achieve sub-linear space complexity with
bounded errors
• The algorithm will give an estimate within ε relative error with δ failure
probability. (Two Parameters)
The Method
j
+g1(j)
+g2(j)

d rows
+gd-1(j)
+gd(j)

• j is the next item in the stream.


• 2-wise independent d hash functions to find the bucket for each row

• After finding the bucket, 4-wise independent d hash functions to


decide inc/dec :
• In a summary :
The Method
j
+g1(j)
+g2(j)

d rows
+gd-1(j)
+gd(j)

• Calculate row estimate


• Median :
4 1
• Choose 𝑤 = 2 and 𝑑 = 8log( ) , by doing so it will give an estimate
𝜖 𝛿
with 𝜖 relative error and 𝛿 failure probability
Why should this method give F2 ?
j

d = 8log 1/δ
+gk(j)

• For kth row :


• Estimate F2 from kth row :
• Each row there would be :
• First part :
• Second part : g(i)g(j) can be +1 or -1 with equal probability, therefore
the expectation is 0.
What guarantee can we give about the accuracy ?
• The variance of Rk, a row estimate, is caused by hashing collisions.
• Given the independent nature of the2
hash functions, we can safely
𝐹
state the variance is bounded by 2 .
𝑤
• Using Chebyshev Inequality,
• Lets assign,

1
• Still the failure probability is is linear in over .
𝑤
What guarantee can we give about the accuracy ?
• We had d number of hash functions, that produce R1, R2, …. Rd
estimates.
• The Median being wrong → Half of the estimates are wrong
• These are independent d estimates, like toin-cosses that have
exponentially decaying probability to get the same outcome.
• They have stronger bounds, Chernoff Bounds :
4
• 𝜇 = 𝑑 #𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑠 ∗ (𝑠𝑢𝑐𝑐𝑒𝑠𝑠 𝑝𝑟𝑜𝑏. )
3
𝑑 𝑑
• 𝑒𝑟𝑟𝑜𝑟 𝑖𝑠 𝑎𝑤𝑎𝑦 𝑓𝑟om mean ∶
2 4

Space and Time Complexity
• E.g. In order to achieve e-10 of tightly bounded accuracy, only 8 * 10
= 80 rows required
• Space complexity is O(log(𝛿)).
• Time complexity will be explained later along with the application
AMS Sketch and Applications

Sapumal Ahangama
Hash functions
• ℎk maps the input domain uniformly to 1,2, … 𝑤 buckets
• ℎ𝑘 should be a pairwise independent hash functions, to cancel
out product terms
– Ex: family of ℎ 𝑥 = 𝑎𝑥 + 𝑏 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 𝑤
– For a and b chosen from prime field 𝑝, 𝑎 ≠ 0
Hash functions
• 𝑔𝑘 maps elements from domain uniformly onto {−1, +1}
• 𝑔𝑘 should be four-wise independent

• Ex: family of g x = 𝑎𝑥 3 + 𝑏𝑥 2 + 𝑐𝑥 + 𝑑 𝑚𝑜𝑑 𝑝 equations

• g 𝑥 = 2 𝑎𝑥 3 + 𝑏𝑥 2 + 𝑐𝑥 + 𝑑 𝑚𝑜𝑑 𝑝 𝑚𝑜𝑑 2 − 1
– for 𝑎, 𝑏, 𝑐, 𝑑 chosen uniformly from prime field 𝑝.
Hash functions
• These hash functions can be computed very quickly, faster even than
more familiar (cryptographic) hash functions
• For scenarios which require very high throughput, efficient
implementations are available for hash functions,
– Based on optimizations for particular values of p, and partial
precomputations

– Ref: M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with


applications to second moment estimation. In ACM-SIAM Symposium on
Discrete Algorithms, 2004
Time complexity - Update
• The sketch is initialized by picking the hash functions to use,
and initializing the array of counters to all zeros
• For each update operation, the item is mapped to an entry in
each row based on the hash functions ℎ𝑗 , multiplied by the
corresponding value of 𝑔𝑗
• Processing each update therefore takes time 𝑂(𝑑)
– since each hash function evaluation takes constant time.
Time complexity - Query
• Found by taking the sum of the squares of each row of the
sketch in turn, and finds the median of these sums.
– That is for each row k, compute σ𝑖 𝐶𝑀[𝑘, 𝑖]2
– Take the median of the d such estimates

• Hence the query time is linear in the size of the sketch,


𝑂(𝑤𝑑) j

d = 8log 1/δ
+gk(j)
Applications - Inner product
• AMS sketch can be used to estimate the inner-product
between a pair of vectors
• Given two frequency distributions 𝑓 𝑎𝑛𝑑 𝑓′
𝑀

𝑓. 𝑓 ′ = ෍ 𝑓 𝑖 ∗ 𝑓 ′ (𝑖)
𝑖=1
• AMS sketch based estimator is an unbiased estimator for the
inner product of the vectors
Inner Product
• Two sketches 𝐶𝑀 and 𝐶𝑀’
• Formed with the same parameters and using the same hash
functions (same 𝑤, 𝑑, ℎ𝑘 , 𝑔𝑘 )
• The row estimate is the inner product of the rows,
𝑤

෍ 𝐶𝑀 𝑘, 𝑖 ∗ 𝐶𝑀′[𝑘, 𝑖]
𝑖=1
Inner Product
• Expanding
𝑤

෍ 𝐶𝑀 𝑘, 𝑖 ∗ 𝐶𝑀′[𝑘, 𝑖]
𝑖=1
• Shows that the estimate gives 𝑓 · 𝑓′ with additional cross-
terms due to collisions of items under ℎ𝑘
• The expectation of these cross terms is zero
– Over the choice of the hash functions, as the function 𝑔𝑘 is equally
likely to add as to subtract any given term.
Inner Product – Join size estimation
• Inner product has a natural interpretation, as the size of the
equi-join between two relations…
• In SQL,
SELECT COUNT(*) FROM D, D’ WHERE D.id =
D’.id
Example
UPDATE(23, 1)
23

h1 h2 h3

1 2 3 4 5 6 7 8
1 0 0 0 0 0 0 0 0

d=3 2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0

w=8 87
Example
UPDATE(23, 1)
23

h1 h2 h3
ℎ1 = 3 ℎ2 = 1 ℎ3 = 7
𝑔1 = −1 𝑔2 = −1 𝑔3 = +1
1 2 3 4 5 6 7 8
1 0 0 -1 0 0 0 0 0

d=3 2 -1 0 0 0 0 0 0 0
3 0 0 0 0 0 0 +1 0

w=8 88
Example
UPDATE(99, 2)
99

h1 h2 h3

1 2 3 4 5 6 7 8
1 0 0 -1 0 0 0 0 0

d=3 2 -1 0 0 0 0 0 0 0
3 0 0 0 0 0 0 +1 0

w=8 89
Example
UPDATE(99, 2)
99

h1 h2 h3
ℎ1 = 5 ℎ2 = 1 ℎ3 = 3
𝑔1 = +1 𝑔2 = −1 𝑔3 = +1
1 2 3 4 5 6 7 8
1 0 0 -1 0 0 0 0 0

d=3 2 -1 0 0 0 0 0 0 0
3 0 0 0 0 0 0 +1 0

w=8 90
Example
UPDATE(99, 2)
99

h1 h2 h3
ℎ1 = 5 ℎ2 = 1 ℎ3 = 3
𝑔1 = +1 𝑔2 = −1 𝑔3 = +1
1 2 3 4 5 6 7 8
1 0 0 -1 0 +2 0 0 0

d=3 2 -3 0 0 0 0 0 0 0
3 0 0 +2 0 0 0 +1 0

w=8 91

You might also like