0% found this document useful (0 votes)
16 views

Randomized Algorithms Notes

The document discusses hashing and its use for efficient membership queries on a set of integers. It describes how a truly random hash function can map elements to an array in O(1) expected time by randomly assigning elements to buckets. However, truly random hash functions require too much space. Instead, c-approximate or k-wise independent hash families are used in practice to achieve similar performance while using less space. The document also discusses how hashing can be applied to nearest neighbor search on binary vectors by mapping vectors to buckets and searching buckets to find the closest neighbor to a query vector.

Uploaded by

Sebastian Yde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Randomized Algorithms Notes

The document discusses hashing and its use for efficient membership queries on a set of integers. It describes how a truly random hash function can map elements to an array in O(1) expected time by randomly assigning elements to buckets. However, truly random hash functions require too much space. Instead, c-approximate or k-wise independent hash families are used in practice to achieve similar performance while using less space. The document also discusses how hashing can be applied to nearest neighbor search on binary vectors by mapping vectors to buckets and searching buckets to find the closest neighbor to a query vector.

Uploaded by

Sebastian Yde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Randomized Algorithms - Eksamen

Sebastian Yde Madsen


August 20, 2023

1 Hashing
• General introductory description of Hashing
– Want to store a set S of n (integer) keys:

S ≡ {x1 , x2 , . . . , xn }, xi ∈ Universe of integers ≡ [U ] = {0, 1, . . . , U − 1} (1)

in a memory economical data structure, that supports membership queries1 (MQ) in an effecient manner -
i.e. generally supports MQ in (expected) sub-linear time.
– Generally accomplished through that utilization of a hash function h ∈ H (where we assume that h can be
evaluated in O(1) time). One could consider the Hashing w. Chaining (HWC) data structure2 w. an array A
of size m, then:
h : [U ] → [m] (2)
– Since we must have m ≪ U , there must always be collisions, i.e.

∃ x ̸= y ∈ [U ] s.t. h(x) = h(y) (3)

resulting in linked lists of sizes > 1.

– To remedy this, i.e. minimize expected length of linked list in HWC (corresponds to expected query time), we
decide:
H = Family of truly random Hash functions (4)
Any h ∈ H is defined by:
Hash of all others.
z }| {
1. ∀ x ∈ [U ], h(x) is independent of h [U ] \ h(x) .
1
2. h is uniformly random in range, i.e. ∀ x ∈ [U ], ∀ i ∈ [m], P[h(x) = i] = m
and prior to storing S, we randomly pick h ∈ H.
– Now, it is easy to realize that we have linear worst case MQ time (in case entire S is mapped to same
linked-list):
1
Worst Case: O(n) with P[h(x1 ) = h(x2 ) = . . . = h(xn )] = (Due to independence of h.) (5)
mn
and we can also easily determine expected MQ time by realizing that it corresponds to expected length of a
linked list in HWC data structure.

Specifically consider expected length of linked list that some query element q hashes to - define binary in-
dicator variable: (
1 , h(xi ) = h(q)
Xi ≡ (6)
0 , o.w.
and then:
n
X
X≡ Xi (7)
i=1

1 I.e. given some x, answer the question: is x ∈ S.


2 Outer array of size m, w. linked lists (tuples of values, and pointers to next (value,pointer)).

1
and then utilize Linearity of Expectation + definition of expectation of discrete random variable:
h i n n n
  X X X
E |A[h(q)]| = E[MQ time] = E X = E[Xi ] = P[Xi = 1] · 1 + P[Xi = 0] · 0 = P[Xi = 1] (8)
i=1 i=1 i=1

and then depending on whether q ∈ S:


n
X X n−1 n
q∈S: P[Xi = 1] = 1 + P[Xi = 1] = 1 + ≤1+ (9)
m m
i=1 xi : xi ̸=q
n
X n
q∈
/S: P[Xi = 1] = (10)
i=1
m

but in general, if we set m = n, we get the bound:


h i   n
E |A[h(q)]| = E[MQ time] = E X ≤ 1 + = 2 (11)
n
→ a constant expected MQ time:
Expected: O(1) (12)
– Furthermore, one might be concerned w. determining an upper bound on the probability of the length of the
linked list |A[h(q)]| being longer than the some value.

Since this is a non-negative random variable we can utilize Markovs Inequality:


  E[X] 2
∀t > 0, P X>t < ≤ (13)
t t
and if we wanted to know a bound of the probability of this random variable being significantly bigger than its
expectation, we might utilize a formulation of the Chernoff bound to find:
h i 1
P X > 4 ln(n)/ ln(ln(n)) < 2 (polynomially decreasing in n.) (14)
n
and we might even utilize Union bound to find that none of the linked lists are longer than 4 ln(n)/ ln(ln(n))
w. probability ≥ 1 − n1 (for m = n):
h i 1
P ∀i : A[i] ≤ 4 ln(n)/ ln(ln(n)) ≥ 1 − (Monotonically increasing in n.) (15)
n
– Note, that in practice one doesn’t utilize Truly Random Hash functions, as that would require storing an
array of size of universe filled w. uniformly random values. This means that one would need (typically much)
more space for storing h, than for storing the keys S - not practical.
Understand this
Instead one utilizes c−approx universal families, defined by collision probability:
h i c
∀x ̸= y ∈ [U ], Ph∈H h(x) = h(y) ≤ (16)
m
or k−wise independent families (sometimes also called k−strongly universal), defined by collision prob-
ability:
(
x1 ̸= x2 ̸= . . . ̸= xk ∈ [U ] h i 1
∀ , Ph∈H h(x1 ) = y1 ∧ h(x2 ) = y2 ∧ . . . ∧ h(xk ) = yk = k (17)
y1 , y2 , . . . , yk ∈ [m] m

or sometimes the combination (c, k)-independent families, defined by:


(
x1 ̸= x2 ̸= . . . ̸= xk ∈ [U ] h i c
∀ , Ph∈H h(x1 ) = y1 ∧ h(x2 ) = y2 ∧ . . . ∧ h(xk ) = yk ≤ k (18)
y1 , y2 , . . . , yk ∈ [m] m

• Specific application - Nearest Neighbour Search

2
– In the following it is assumed that we are dealing w. binary vectors:

U ≡ {0, 1}d corresponding to [U ] = [2d ] (19)

and Hamming distance measure:

distHAM (x, y) = |{i : x[i] ̸= y[i]}| (20)

– One of the practical applications of hashing is it’s utilization in Nearest Neighbour Search (NNS). NNS is
defined as follows: Given n data points:
S ≡ {x1 , x2 , . . . , xn } (21)
stored in a data structure, and a relevant distance measure defined on the ’universe’:

dist : U → R≥0 (22)

and some query element q, return the element in x∗ ∈ S closest to q - specifically:

x∗ = min dist(x, q)
 
(23)
x∈S

– Problem w. exact
 NNS is that (mostly believed) that there exist no alg. w. sub-linear expected query time when
using O poly(n) space and having dim(U ) ≥ Ω(lg(n)).
– Therefore we instead consider an approximate formulation called c−Approx R−NNS: if:
 
min dist(x, q) = R, (24)
x∈S

then return any:


xi with dist(q, xi ) ≤ cR, (c =approximation factor.), (25)
– Specifically, we consider relaxed c−Approx R−NNS where we fix R initially, and say: if:

∃ xi with dist(xi , q) ≤ R (26)

return any:
xj with dist(xj , q) ≤ cR. (27)
– Turns out that we can solve c−Approx R−NNS w. acceptable overhead by construction of multiple relaxed
c−Approx R−NNS data structures w. increasing R:

R0 , (1 + α)R0 , . . . , (1 + α)i R0 , . . . , (1 + α)N R0 (28)

– Now, say the smallest of these that returns an xj has R = (1 + α)i R0 → we know that there is no points with
dist(x, q) < (1 + α)i−1 R0 and therefore that:

R = min dist(q, x) ≥ (1 + α)i−1 R0


 
(29)
x∈S

and as such that we have solved the original c−Approx R−NNS with. approximation factor c(1 + α).
– If turns out to be reasonable to set:

(1 + α)N R0 ≡ max dist(x, y)


   
R0 ≡ min dist(x, y) , (30)
x̸=y∈S x̸=y∈S

s.t.:
    
maxx̸=y∈S dist(x, y) d
# relaxed c−Approx R−NNS = log1+α   ≤ log1+α = log1+α (d) (31)
minx̸=y∈S dist(x, y) 1

and to get rid of weird logarithmic basis, we utilize 1. order Mclaurin Expansion ex ≈ 1 + x and write:

ln(d) ln(d) ln(d)


log1+α (d) = ≈ = (32)
ln(1 + α) ln(eα ) α

3
– Now, say that each of them takes space s0 and time t0 , and that we utilize binary search to find the smallest one
returning some xj , then the time and space consumption becomes:
 
ln(d)   
t = t0 log = t0 log ln(d) + log(1/α) (33)
α
ln(d)
s = s0 (34)
α
sub-linear overhead → acceptable.

– Now, lets see how we might utilize Locality sensitive hashing in this context. This is a type of hash function,
where the collision probability is inversely proportional to the distance → close points collide often and vice versa.
– We say that the family H is (R, cR, P1 , P2 )−sensitive iff:
1. ∀ x, y with dist(x, y) ≤ R ⇒ PH [h(x) = h(y)] ≥ P1
2. ∀ x, y with dist(x, y) ≥ cR ⇒ PH [h(x) = h(y)] ≤ P2
(for P1 > P2 ).
– In the case of eq. (19), we can check that the family of hash-functions that returns (at uniform random) a bit of
the binary vectors: 
H ≡ h(x) = x[i] i = 1, 2, . . . , d (35)
s.t.:
      1
∀ x ∈ [U ], PH h(x) = x[1] = PH h(x) = x[2] = . . . = PH h(x) = x[d] = . (36)
d
is (R, cR, P1 , P2 )−sensitive. This is done by picking some x, y ∈ U w. dist(x, y) ≥ R and seeing that:
  R
PH h(x) = h(y) ≥ 1 − (37)
d
R cR R cR
s.t. P1 = 1 − d. Equivalently one can determine P2 = 1 − d , s.t. that H is (R, cR, 1 − d,1 − d )−sensitive.
– Now, naive and tempting to use HWC to store S, but it turns out that this results in both P2 (prob. of ’far away’
points hashing to same), being to big, and P1 (prob. of ’close’ points hashing to same) being to small for usual
vals. of R, d - in fact we can by choosing some values of R, d see that we have a small constant gap= P1 − P2 (we
need to create a big - possibly not constant - one).
– What we do to remedy this is to draw k hash functions from H (independently) and define hash of some point
x ∈ U , as concatenation:
bit-string∈{0,1}k
z }| {
g(x) = h1 (x) ◦ h2 (x) ◦ . . . ◦ hk (x) (38)
resulting in a ’far away’ collision prob. bound:

∀ x, y ∈ U, dist(x, y) ≥ cR ⇒ Pg∼H g(x) = g(y) ≤ P2k


 
(Just prod. as hi ’s ∼ H independently.) (39)

and common design choice is to take:  


1 1
P2k = ⇐⇒ k = logP2 (40)
n n
– From here one can show that the expected nr. of ’far away collision points’ is at most 1. Specifically, by utilizing
Linearity of Expectation + definition of expectation of discrete random variable it is seen that:
 
E |{x ∈ S : g(x) = g(q) ∧ dist(x, q) ≥ cR}| ≤ 1 (41)

– However, we are still faced w. problem from definition of alg. - if there is any xi ∈ S with dist(xi , q) ≤ R the
algorithms should return some xj with dist(xj , q) ≤ cR, but what if no x ∈ S w. dist(x, q) ≤ cR, that hashes to
same as query → then alg. doesn’t return anything valid.
– In particular, if we consider the ’close point’ collision probability now:

∀ x, y ∈ U, dist(x, y) ≤ R ⇒ Pg∼H g(x) = g(y) ≥ P1k (Just prod. as hi ’s ∼ H independently.)


 
(42)

we can see that unfortunately P1k is to small for common vals. of R, d (even though concatenation of k hash
functions has made initially constant gap (P1 − P2 ) → some polynomially big gap in n → (P1k − P2k ) is now
function of n. )

4
– Final step to remedy this is and guarantee a success prob. of at least 12 , is to draw L independent g’s from H (a
total of L × k independent h’s.):
g1 (x) = h11 (x) ◦ h12 (x) ◦ . . . ◦ h1k (x)
g2 (x) = h21 (x) ◦ h22 (x) ◦ . . . ◦ h2k (x)
..
.
gL (x) = hL1 (x) ◦ hL2 (x) ◦ . . . ◦ hLk (x) (43)
(44)
and thus create hash table w. L copies.
– The final algorithm then becomes: and as it solves the problem w. probability 1/2, one could always repeat

Algorithm 1 : Locality Sensitive Hashing Query for Relaxed c-Approx R-NNS.


procedure query(q)
counter ← 0
for i = 1 to L do
for xj ∈ {x ∈ S : gi (x) = gi (q)} do ▷ All points that hash to same as query (for ith hash function).
counter ← counter + 1
if dist(xj , q) ≤ cR then
return xj
end if
if counter > 3L then
Abort algorithm
end if
end for
end for
end procedure

multiple times to improve probability of success.


– If we store pointers in the L copies (of the n data points), the space usage becomes:
Space : O(nL + nd), (’n’ d−dim points take nd space.) (45)
and as we have to compute distance (which takes d for each point) at most 3L times, and if we assume it takes
t0 time to evaluate hij (such that it takes kt0 to evaluate each gi ), the time usage becomes:
Time : O(dL + kt0 L) (46)

Is this expected or worst case or?

– At this point another design choice is made - specifically chose:


1
L = log1−P1k (47)
8
such that the probability of any ≤ R ’close points’ xi hashing to same bucket (linked list) as the query point q,
becomes less than 1/8:
  L  L 1
< 1 − P1k
 
Pg∼H ∀ j, gj (xi ) ̸= gj (q) = 1 − Pg∼H g(xi ) = g(q) = (48)
8
– Finally - to see that this results in a success probability > 1/2 we will utilize Union Bound - therefore we
initially define the 2 bad events resulting in unsuccessful execution:
1. Case w. at least one xi ∈ S w. dist(xi , q) ≤ R (For sake of analysis just consider case w. only 1) but it never
hashes to same as q:
E1 ≡ {∀ j, gj (xi ) ̸= gj (q)} (49)
2. Case w. more than 3L ’far away’ (≥ cR) points hashing to same as query (then its possible that alg. aborts
even though there might be point ≤ R):
l
 X  
E2 ≡ {xi ∈ S : dist(xi , q) ≥ cR ∧ gj (xi ) = gj (q)} > 3L (50)
j=1

5
to this end, we bound the probability of E2 via Markov’s inequality:
E 1gj (x)=g(q)
PL P  
  j=1 x∈S: PL 1
  E x dist(x,q)≥cR j=1 n n 1
P[E2 ] = P X > 3L < = ≤ = (51)
3L 3L 3L 3
such that, by utilizing union bound, we see:
h i   X   1 1 13 1
P E1 ∪ E2 = 1 − P E1 ∪ E2 ≥ 1 − P Ei = 1 − − = > (52)
i
8 3 24 2

Furthermore, we may determine a bound on L. By getting rid of annoying basis in eq. (47) and plugging in values
of P1 , P2 , we arrive at:
1
L ≤ 3n c (53)
and then one can see the effect of choice of approximation factor, on the space and time consumption as defined
in eqs. (45) and (46).

2 Streaming and Dimensionality Reduction


• General introductory description of Streaming
– Fundamental idea is that we only receive or read a smaller portion of some bigger data set, and that we do it in
such a way that we can only see each element xi once - a stream of elements. If we end up having streamed n
elements:
x1 , x2 , . . . , xn , xi ∈ U (54)
a fundamental goal/requirement of alg. is to use much less memory than n - specifically:
Memory : O poly(log(n)) = O(log(n)c )

(55)

– Even though we cannot access all data at once, we might still be able to calculate interesting properties of data,
with high accuracy (with high probability). A typical goal is to determine frequencies of items - i.e. how often
various things occur in stream.
– Naive approach would be to store counter for ∀ xi ∈ U , but this would typically be very memory expensive → it
is not immediately clear than one can determine most frequent item (Heavy Hitter) in stream w. memory
≪ min(n, |U |).
• Lets consider a simple problem that only requires approximation (and not also randomization) for solving - we solve
w. deterministic algorithm.
– The Frequency estimation / Point queries problem - defined by query(i): How many times has the ith
element occurred (so far)?
– (Think of elems. as keys from now on) Quickly define frequency vector:
f ∈ R|U | , (An entry for each elem. in ’universe’.) (56)
th
with i entry defined as # occurrences of xi ∈ U , in stream, so far.
As such, one might consider the problem as that of creating a smaller (≪ R|U | ) representation f˜, of f → then
query(i) should return f˜i .
– Simply algorithm achieving this is Misra-Gries - takes space budget k (size of f˜) and supports Update(i) and
Estimate(i):

Algorithm 2 : Misra-Gries Update(i)


1: procedure Update(i) Algorithm 3 : Misra-Gries Estimate(i)
2: if exists counter ci for key i then
1: procedure Estimate(i)
3: ci ← ci + 1.
2: if exists counter ci for key i then
4: else if have < k counters stored then
3: Return ci .
5: add (i, 1) as counter.
4: else
6: else
5: Return 0.
7: Decrement all counters by 1.
6: end if
8: remove all counters = 0.
7: end procedure
9: end if
10: end procedure

6
– Obviously doesn’t always return exact count (decrements and remove counters sometimes). Lets analyze how
close it comes. It’s fairly obvious that:

f˜i ≤ fi (Never over-count) (57)

and to give a lower bound one must realize:


1. Difference between exact count and estimate (|f˜i − fi |) only grows by 1 during else in the Update(i).
2. The sum of all counts in f˜ has to be ≥ 0.
3. each else in the Update(i) decrements sum of all counts in f˜ by k (because it decrements all by 1).
as such else in the Update(i) is ’visited’ at most n/k times (otherwise sum of counts could become negative),
and the complete interval bound is therefore:
n
fi − ≤ f˜i ≤ fi (58)
k

and therefore we also know that every item/key i which occurs more than n/k times in stream is guaranteed to have
a counter ci > 0 in f˜.
• Now, lets generalize problem in way that requires randomization for solution (if one wants to store anything less than
entire stream).

– Specifically, we now consider possibility of performing generally sized integer updates of counters (instead of just
±1), for key i we can perform:

f˜i ← f˜i + ∆, ∆ ∈ {−M, −M + 1, . . . M − 1, M } (59)

here one considers 2 different frameworks:


1. Strict Turnstile: we can still add and subtract from frequency, but we guarantee it remains non-negative,
i.e. f˜i ≥ 0
2. General Turnstile: we can add and subtract from frequency, and we don’t guarantee it remains non-
negative.
– Lets first consider the Strict turnstile setting, and define the kind of guarantee we want:
always true
z}|{  X 
fi ≤ f˜i ≤ fi + ε · ||f ||1 , with the 1−norm defined as ||f ||1 ≡ |fi | (60)
| {z }
i
true with prob. 1 − δ

Understand why it never underestimates


and then outline the data structure and alg. used to obtain this. Given ′ k ′ -sized array A:
k

A:

and some 1-approx universal hash function:


  1
h : U → [k], with ∀x ̸= y ∈ U : Ph∼H h(x) = h(y) ≤ (61)
k
the data structure supports Update(i, ∆) and Estimate(i), like:

Algorithm 4 : Strict turnstile Update(i, ∆) Algorithm 5 : Strict turnstile Estimate(i)


1: procedure Update(i, ∆) 1: procedure Estimate(i)
2: A[h(i)] ← A[h(i)] + ∆ 2: Return A[h(i)]
3: end procedure 3: end procedure

– Now, lets determine the space budget (value of k) required to have an additive error of at most ε||f ||1 w.
probability 1 − δ - corresponding to failing w. prob. δ.

7
– Consider the value that the array in principle holds for any key i:
≡X
zX }| {
A[h(i)] = True frequency + Noise from colliding keys = fi + 1h(xj )=h(xi ) fj (62)
j : j̸=i

and specifically (as the noise X is non-negative R.V.) utilize Markov’s Inequality, to determine probability of
having too large additive error:  
 E X
P X > ε||f ||1 ] < (63)
ε||f ||1
and to that end bound the expectation utilizing Linearity of Expectation + definition of expectation of
discrete random variable:
=P[h(i)=h(j)]
 X  z h }| i{ X X 1 ||f ||1
fj 1h(j)=h(i) = fj E 1h(i)=h(j) <
  X
E X =E fj P[h(i) = h(j)] ≤ fj = (64)
j j
k k
j : j̸=i j : j̸=i

such that:
 1
P X > ε||f ||1 ] < (65)
εk
and then, for some given success prob. 1 − δ (and resulting failure prob. δ), and additive error factor ε, one can
always chose k:
1 1
<δ⇔k> (66)
εk εδ
– Now, it actually turns out that one can get an even smaller memory dependence on δ. In fact it suffices to chose
k:   1 
1
k=O log (67)
ε δ
and what we do to achieve this dependence is usual trick of repeating data structure. Specifically just perform
t independent repetitions:
k
A1 : , h1
A2 : , h2
..
.
At : , ht

and then define the supported methods be:

Algorithm 6 : Count-min sketch Update(i, ∆)


1: procedure Update(i, ∆) Algorithm 7 : Count-min sketch Estimate(i)
2: for j = 1, . . . , t do 1: procedure Estimate(i)
3: Aj [hj (i)] ← Aj [hj (i)] + ∆ 2: Return minj Aj [hj (i)]
4: end for 3: end procedure
5: end procedure

where we always return the value from the Aj w. the lowest estimate, as we pr. definition only over estimates
exact frequency (and never under estimates), and as such, this will always be result w. lowest additive error →
therefore also named Count-min Sketch.
– Now, the probability of this failing corresponds to the probability of all the individual ones failing simultaneously,
which is just the product as the hash functions are drawn independently. Now, if we impose the reasonable req.
of wanting the individual ones to fail w. at most prob. 1/2, s.t. δj = 1/2, and:
 1 1
P X > ε||f ||1 ] < < (68)
εk 2
then the total prob. of failure becomes exponentially decreasing in t:
  t
P X > ε||f ||1 ] < 2−t (69)

8
and then requiring this to be at most δ:
1
2−t < δ ⇔ t = log2 (70)
δ
we get the logarithmic dependency on δ.
– From here it is also evident that both Update and Estimate takes logarithmic time (in the chosen failure prob.):
  1 
Time : O(t) = O log2 (71)
δ

– Lets now consider General Turnstile and instead of aiming for failure prob. δ = 1/2 (as before w. Count-
min), we will star out w. aiming for δ = 1/4, and the guarantee that we will be considering can both over- and
underestimate:
sX !
fi − ε||f ||2 ≤ f˜i ≤ fi + ε||f ||2 , with the 2−norm defined as ||f ||2 ≡ fi2 (72)
i

– We will again just start w. one k−sized array A:


k

A:

but, this time this array will be accompanied but 2 hash functions, h and g. Where h is 1-approx universal (as
before), but g is 2−wise independent, and maps to {±1}:
  1
h : U → [k], with ∀x ̸= y ∈ U : Ph∼H h(x) = h(y) ≤ (73)
( k
x1 ̸= x2 ∈ U h i 1
g : U → {±1}, with ∀ , Pg∈H g(x1 ) = y1 ∧ g(x2 ) = y2 = 2 (74)
y1 , y2 ∈ {±1} k

and where we then define Update and Estimate as:

Algorithm 8 : General turnstile Update(i, ∆) Algorithm 9 : General turnstile Estimate(i)


1: procedure Update(i, ∆) 1: procedure Estimate(i)
2: A[h(i)] ← A[h(i)] + g(i) · ∆ 2: Return g(i) · A[h(i)]
3: end procedure 3: end procedure

– The general idea behind the randomization of the sign from g is that it hopefully will minimize the noise as some
of it will cancel out with other parts of it, and when multiplying w. sign from g in Estimate the sign of g from
the ’real’ value - given in update - is then cancelled (because (±1)2 = 1).
– Now, to see what value for k we must chose for eq. (72) to fail w. prob. δ = 1/4, lets repeat strategy from strict
turnstile, and consider ’theoretical’ output of Estimate:
   
1h(j)=h(i) fj g(j)
X
g(i)A[h(i)] = g(i) signed True frequency + signed Noise from colliding keys = g(i) g(i)fi +
j : j̸=i

1h(j)=h(i) g(i)g(j)fj
X
2
= g(i) fi + (75)
| {z }
j : j̸=i
(±1)2 =1 | {z }
≡X

and with the intention of determining the failure probability, lets start out by calculating the expectation of the
noise-part, by utilizing linearity of expectation, the fact that for any 2 independent R.V’s E[a · b] = E[a] · E[b]
and the definition of expectation of discrete random variable:
 X  X h i 
1h(j)=h(i) g(i)g(j)fj = E 1h(j)=h(i) E g(i)g(j) fj =
   X      
E X =E P h(j) = h(i) E g(i) E g(j) fj = 0
j : j̸=i j : j̸=i j : j̸=i
| {z } | {z }
=0 =0

Now, because X is not in general a non-negative R.V. we cannot use Markov’s Inequality, but still Cheby-
shev’s Inequality:
h i Var[X] E[X 2 ] − E[X]2
P X − E[X] > t < 2
= (76)
t t2

9
and by virtue of the fact that E[X] = 0, we can estimate the probability of violating eq. (72) by setting t = ε||f ||2 :
h i E[X 2 ]
P |X| > ε||f ||2 < 2 (77)
ε ||f ||22

s.t. we simply need to determine (bound) E[X 2 ] in order to bound the failure prob. in terms of k, ε:
 X   X X 
1h(j)=h(i) g(i)g(j)fj 1h(k)=h(i) g(i)g(k)fk = E 1h(j)=h(i) 1h(k)=h(i) g(i)2 g(j)g(k)fj fk
X
E X2 = E
 

j : j̸=i k : k̸=i j : j̸=i k : k̸=i


 
E 1h(j)=h(i) 1h(k)=h(i) E g(j)g(k) fj fk , g(j)g(k) = 1, 1h(j)=h(i) 1h(k)=h(i) = 1h(j)=h(i) , j = k
X X    
=
j : j̸=i k : k̸=i
| {z }
0,j̸=k
X1 ||f ||22
E 1h(j)=h(i) fj2 <
X
fj2 =
 
= (78)
j
k k
j : j̸=i 
| {z }
P h(j)=h(i)

s.t.: h i E[X 2 ] 1
P |X| > ε||f ||2 < 2 2 < 2 (79)
ε ||f ||2 ε k
and imposing failure prob. of at most δ = 1/4 we get:
4
k> (80)
ε2
however as before (with strict turnstile), this is not the optimal space usage, and once again what we do is to
create t independent copies of the data structure:
4
k= ε2

A1 : , (h1 , g1 )
A2 : , (h2 , g2 )
..
.
At : , (ht , gt )

and implement the methods as:

Algorithm 10 : Count-median sketch Update(i, ∆)


1: procedure Update(i, ∆) Algorithm 11 : Count-median sketch Estimate(i)
2: for j = 1, . . . , t do 1: procedure Estimate(i)
3: Aj [hj (i)] ← gj (i)Aj [hj (i)] + ∆ 2: Return medianj Aj [hj (i)]
4: end for 3: end procedure
5: end procedure

and as before with count-min sketch in strict turnstile, we consider how many of the t arrays that has to
fail, for the Estimate to fail.
– Specifically, consider the estimates from each of the t arrays for some ith key, ordered numerically:

f˜i,1 ≤ f˜i,2 ≤ . . . ≤ median ≤ . . . ≤ f˜i,t−1 ≤ f˜i,t (81)

now, clearly, if the median is too low (< fi − ε||f ||2 ), then all vals. to left (half) is also too low, and if median is
too high (> fi + ε||f ||2 ), then all the values to the right (half) is also.
– As such, we might define (
maybe this is just bounded by prob. of at least half failing ? why =
) the failure prob. (which we want to be at most δ) as prob. of more than half the arrays failing:
t
(
h ti X 1, Aj fails.
P X> ≤ δ, X ≡ Xj , Xj ≡ (82)
2 j=1
0, o.w.

10
as we are dealing w. sum of independent 0-1 R.V.’s we utilize Chernoff ’s Inequality, and if we do, we get:
 t/4
h ti e
P X> ≤ ≤δ (83)
2 4
from which the number of arrays becomes:  
1
t ≥ 4 log 4e (84)
δ

3 Applications
• Lets consider a specific application of Randomized algorithms. Specifically Minimum Cut.
– Given an undirected and unweighted graph:
(
Vertices V
G ≡ (V, E), (85)
Edges E ≡ {{i, j} ∈ V : i ̸= j}
| {z }
Unordered pair

the problem of determining a Minimum cut, is defined as that of partitioning the nodes/vertices of the graph
into two disjunct sets, s.t. the number of edges between the two sets are as small as possible, e.g.:

1 2 5 6 1 2 5 6

3 4 7 8 3 4 7 8

– Also quickly define the problem of determining the Minimum s-t cut as the problem of partitioning the
nodes/vertices of the graph into two disjunct sets, s.t. the number of edges between the two sets are as small
as possible, but where it is predetermined that node s has to be in the one set and node t in the other, e.g.:

t 2 5 6 t 2 5 6

3 s 7 8 3 s 7 8

• There exists multiple ways to Deterministically compute the (global) Minimum cut, e.g. :
• A randomized approach is the alg. known as Karger’s (contraction) algorithm:
– Lets start by defining a contraction of a graph. A Contraction of 2 nodes a, b in a graph G is simply the process
of merging them into a ’super node’ ab, creating a new graph G′ :

a 2 ab 2

b 4 4

s.t. there are now 2 parallel edges from ′ ab′ to 4 (both the one from a and the one from b).
– We might repeat this process until there is only two nodes/supernodes left in the resulting graph. Now, obviously,
depending on the sequence of contractions performed, the resulting graph might have a different number of
edges between the 2 supernodes.
– As such, the naive implementations of Karger’s alg. becomes:

Algorithm 12 : Naive Karger’s algorithm


procedure MinCut(G = (V, E))
while |V | > 2 do ▷ Will pick n − 2 edges at random. (n ≡ |V |)
pick (i, j) ∈ E uniformly at random
contract nodes (i), (j) → (ij)
G ← G′
end while
Return G′
end procedure

11
– By the assumption that it takes O(n) time to update node-edge information pr. contraction and given the fact
that we will perform n − 2 contractions, the algorithm has running time:

Time : O(n2 ) (86)

– Lets analyze the algorithm by considering the probability of it determining the (global) minimum cut → we
consider absolute success prob., and no approximation ratio.
– Lets define the minimum cut of a graph as the set C of edges involved in the cut → the size of the cut is then
|C|.
– Begin by making observation that if none of the edges in C are among the n − 2 (where n = |V |) contractions C
has survived and the edges connecting the two supernodes in the resulting graph corresponds to the min. cut C.
– Define:

Ei ≡ An edge of C is picked for the ith contraction.


Si ≡ All edges in C has survived after the ith contraction.

as such - P [Sn−2 ] corresponds to having contracted none of the edges in the (global) min. cut C at the end of
the algorithm, and is given as:
  h i h i h i h i
P Sn−2 = P E1 · P E2 E1 · P E3 E1 ∩ E2 · . . . · P En−2 E1 ∩ E2 ∩ . . . ∩ En−3 (87)
n−2
" i−1
#
Y \
= P Ei Ej , (Just prod. as each step is taken independently.) (88)
i=1 j=1

– To bound the probability, consider:


∗ Recall the fact that the number of edges incident to any node/vertice in a graph, is denoted the degree, and
since every edge incident to a node/vertice also is incident to exactly one other, in general it is given that:
X
deg(v) = 2|E| (89)
v∈V

∗ Furthermore, at (just before) the ith contraction, the graph has n − i + 1 nodes/vertices, and at every step
in the sequence of contractions, any ith node/vertice has deg(v) ≥ |C| (else C wouldn’t be the global min.
cut). Therefore:
(n − i + 1)|C|
# Edges at ith contraction ≥ (90)
2
and we might (upper) bound the probability of picking some e ∈ C at ith contraction (without having picked
any in advance): " #
i−1
\ # Edges in min. cut 2
P Ei Ej ≤ th
= (91)
j=1
# Edges at i contraction n − i+1

– As such, the probability of not having picked any e ∈ C at the ith contraction is bounded as:
" i−1
#
\ 2
P Ei Ej ≥ 1 − (92)
j=1
n−i+1

resulting in an algorithm that takes O(n2 ) running timer, but has a polynomially decreasing success probability:
" # n−2
  n−2Y i−1
\ Y 2
P Sn−2 = P Ei Ej ≥ 1−
i=1 j=1 i=1
n−i+1
      −1
2 2 2 n−2n−3n−4 1 2 n
= 1− 1− ... 1 − = ... = = (93)
n n−1 3 n n−1n−2 3 n(n − 1) 2

– However, as so often before, what we can do is simply repeat the algorithm k times to improve success probability.
Specifically, say that we require failure prob to be:
h i 1
P Sn−2 ≤ (94)
n

12
we can solve for k, and find that:
 k 1

2 1 ln n − ln(n)
1− ≤ ⇔k≥ 2
= 2
 (95)
n(n − 1) n ln 1 − n(n−1) ln 1 − n(n−1)

and then utilizing that ln(.) is monotonically increasing, s.t. ∀a < b : ln(a) < ln(b), together w. fact that
∀x : e−x ≥ 1 − x (is obvious from considering Maclaurin expansion of ex ):

− ln(n) − ln(n) n(n − 1)


k≥ 2
≥ 2 = ln(n) (96)
ln 1 − n(n−1) − n(n−1) 2

s.t. we must repeat Karger’s algorithm O(n2 ln(n)) to guarantee:


  1
P Sn−2 ≥ 1 − (97)
n
resulting in a total runtime:
O n4 ln(n)

Time : (98)
which is still better than deterministic Edmonds-Karp (at least for fairly dense graphs).
• Lets now consider an improvement to the process of repeating Karger’s (contraction) algorithm:
– First, observe the fact that each factor is decreasing in eq. (93), and specifically, that performing n−l contractions
yields:
  l(l − 1)
P Sn−l ≥ (99)
n(n − 1)
prob. of the specific min. cut C having survived thus far. As such, setting l ≈ √1 3 yields:
2

  1
P Sn− √n2 ≥ (100)
2
– Idea is then that, if one only performs the first n − l contractions in each run of Karger’s algorithm, the prob.
of a specific min. cut surviving is bigger. Based on this, the Fast Cut algorithm is devised as:

Algorithm 13 : Fast Min. Cut algorithm


procedure FastMinCut(G = (V, E))
if |V | ≤ 6 then
Return Brute-force solution.
end if
G1 ← Karger’s algorithm on G using n − √n contractions.
n
G2 ← Karger’s algorithm on G using n − √n contractions. ▷ Using different RNG seed to make independent.
) n
X1 ← FastMinCut(G1 )
▷ Recursive calls
X2 ← FastMinCut(G2 )
Return minimum cut between G1 and G2 .
end procedure

– The first call takes O(n2 ) because algorithm has to perform Karger’s contraction

3 Specifically

it should be 1/ 1.69722 to be accurate.

13

You might also like