Randomized Algorithms Notes
Randomized Algorithms Notes
1 Hashing
• General introductory description of Hashing
– Want to store a set S of n (integer) keys:
in a memory economical data structure, that supports membership queries1 (MQ) in an effecient manner -
i.e. generally supports MQ in (expected) sub-linear time.
– Generally accomplished through that utilization of a hash function h ∈ H (where we assume that h can be
evaluated in O(1) time). One could consider the Hashing w. Chaining (HWC) data structure2 w. an array A
of size m, then:
h : [U ] → [m] (2)
– Since we must have m ≪ U , there must always be collisions, i.e.
– To remedy this, i.e. minimize expected length of linked list in HWC (corresponds to expected query time), we
decide:
H = Family of truly random Hash functions (4)
Any h ∈ H is defined by:
Hash of all others.
z }| {
1. ∀ x ∈ [U ], h(x) is independent of h [U ] \ h(x) .
1
2. h is uniformly random in range, i.e. ∀ x ∈ [U ], ∀ i ∈ [m], P[h(x) = i] = m
and prior to storing S, we randomly pick h ∈ H.
– Now, it is easy to realize that we have linear worst case MQ time (in case entire S is mapped to same
linked-list):
1
Worst Case: O(n) with P[h(x1 ) = h(x2 ) = . . . = h(xn )] = (Due to independence of h.) (5)
mn
and we can also easily determine expected MQ time by realizing that it corresponds to expected length of a
linked list in HWC data structure.
Specifically consider expected length of linked list that some query element q hashes to - define binary in-
dicator variable: (
1 , h(xi ) = h(q)
Xi ≡ (6)
0 , o.w.
and then:
n
X
X≡ Xi (7)
i=1
1
and then utilize Linearity of Expectation + definition of expectation of discrete random variable:
h i n n n
X X X
E |A[h(q)]| = E[MQ time] = E X = E[Xi ] = P[Xi = 1] · 1 + P[Xi = 0] · 0 = P[Xi = 1] (8)
i=1 i=1 i=1
2
– In the following it is assumed that we are dealing w. binary vectors:
– One of the practical applications of hashing is it’s utilization in Nearest Neighbour Search (NNS). NNS is
defined as follows: Given n data points:
S ≡ {x1 , x2 , . . . , xn } (21)
stored in a data structure, and a relevant distance measure defined on the ’universe’:
x∗ = min dist(x, q)
(23)
x∈S
– Problem w. exact
NNS is that (mostly believed) that there exist no alg. w. sub-linear expected query time when
using O poly(n) space and having dim(U ) ≥ Ω(lg(n)).
– Therefore we instead consider an approximate formulation called c−Approx R−NNS: if:
min dist(x, q) = R, (24)
x∈S
return any:
xj with dist(xj , q) ≤ cR. (27)
– Turns out that we can solve c−Approx R−NNS w. acceptable overhead by construction of multiple relaxed
c−Approx R−NNS data structures w. increasing R:
– Now, say the smallest of these that returns an xj has R = (1 + α)i R0 → we know that there is no points with
dist(x, q) < (1 + α)i−1 R0 and therefore that:
and as such that we have solved the original c−Approx R−NNS with. approximation factor c(1 + α).
– If turns out to be reasonable to set:
s.t.:
maxx̸=y∈S dist(x, y) d
# relaxed c−Approx R−NNS = log1+α ≤ log1+α = log1+α (d) (31)
minx̸=y∈S dist(x, y) 1
and to get rid of weird logarithmic basis, we utilize 1. order Mclaurin Expansion ex ≈ 1 + x and write:
3
– Now, say that each of them takes space s0 and time t0 , and that we utilize binary search to find the smallest one
returning some xj , then the time and space consumption becomes:
ln(d)
t = t0 log = t0 log ln(d) + log(1/α) (33)
α
ln(d)
s = s0 (34)
α
sub-linear overhead → acceptable.
– Now, lets see how we might utilize Locality sensitive hashing in this context. This is a type of hash function,
where the collision probability is inversely proportional to the distance → close points collide often and vice versa.
– We say that the family H is (R, cR, P1 , P2 )−sensitive iff:
1. ∀ x, y with dist(x, y) ≤ R ⇒ PH [h(x) = h(y)] ≥ P1
2. ∀ x, y with dist(x, y) ≥ cR ⇒ PH [h(x) = h(y)] ≤ P2
(for P1 > P2 ).
– In the case of eq. (19), we can check that the family of hash-functions that returns (at uniform random) a bit of
the binary vectors:
H ≡ h(x) = x[i] i = 1, 2, . . . , d (35)
s.t.:
1
∀ x ∈ [U ], PH h(x) = x[1] = PH h(x) = x[2] = . . . = PH h(x) = x[d] = . (36)
d
is (R, cR, P1 , P2 )−sensitive. This is done by picking some x, y ∈ U w. dist(x, y) ≥ R and seeing that:
R
PH h(x) = h(y) ≥ 1 − (37)
d
R cR R cR
s.t. P1 = 1 − d. Equivalently one can determine P2 = 1 − d , s.t. that H is (R, cR, 1 − d,1 − d )−sensitive.
– Now, naive and tempting to use HWC to store S, but it turns out that this results in both P2 (prob. of ’far away’
points hashing to same), being to big, and P1 (prob. of ’close’ points hashing to same) being to small for usual
vals. of R, d - in fact we can by choosing some values of R, d see that we have a small constant gap= P1 − P2 (we
need to create a big - possibly not constant - one).
– What we do to remedy this is to draw k hash functions from H (independently) and define hash of some point
x ∈ U , as concatenation:
bit-string∈{0,1}k
z }| {
g(x) = h1 (x) ◦ h2 (x) ◦ . . . ◦ hk (x) (38)
resulting in a ’far away’ collision prob. bound:
– However, we are still faced w. problem from definition of alg. - if there is any xi ∈ S with dist(xi , q) ≤ R the
algorithms should return some xj with dist(xj , q) ≤ cR, but what if no x ∈ S w. dist(x, q) ≤ cR, that hashes to
same as query → then alg. doesn’t return anything valid.
– In particular, if we consider the ’close point’ collision probability now:
we can see that unfortunately P1k is to small for common vals. of R, d (even though concatenation of k hash
functions has made initially constant gap (P1 − P2 ) → some polynomially big gap in n → (P1k − P2k ) is now
function of n. )
4
– Final step to remedy this is and guarantee a success prob. of at least 12 , is to draw L independent g’s from H (a
total of L × k independent h’s.):
g1 (x) = h11 (x) ◦ h12 (x) ◦ . . . ◦ h1k (x)
g2 (x) = h21 (x) ◦ h22 (x) ◦ . . . ◦ h2k (x)
..
.
gL (x) = hL1 (x) ◦ hL2 (x) ◦ . . . ◦ hLk (x) (43)
(44)
and thus create hash table w. L copies.
– The final algorithm then becomes: and as it solves the problem w. probability 1/2, one could always repeat
5
to this end, we bound the probability of E2 via Markov’s inequality:
E 1gj (x)=g(q)
PL P
j=1 x∈S: PL 1
E x dist(x,q)≥cR j=1 n n 1
P[E2 ] = P X > 3L < = ≤ = (51)
3L 3L 3L 3
such that, by utilizing union bound, we see:
h i X 1 1 13 1
P E1 ∪ E2 = 1 − P E1 ∪ E2 ≥ 1 − P Ei = 1 − − = > (52)
i
8 3 24 2
Furthermore, we may determine a bound on L. By getting rid of annoying basis in eq. (47) and plugging in values
of P1 , P2 , we arrive at:
1
L ≤ 3n c (53)
and then one can see the effect of choice of approximation factor, on the space and time consumption as defined
in eqs. (45) and (46).
– Even though we cannot access all data at once, we might still be able to calculate interesting properties of data,
with high accuracy (with high probability). A typical goal is to determine frequencies of items - i.e. how often
various things occur in stream.
– Naive approach would be to store counter for ∀ xi ∈ U , but this would typically be very memory expensive → it
is not immediately clear than one can determine most frequent item (Heavy Hitter) in stream w. memory
≪ min(n, |U |).
• Lets consider a simple problem that only requires approximation (and not also randomization) for solving - we solve
w. deterministic algorithm.
– The Frequency estimation / Point queries problem - defined by query(i): How many times has the ith
element occurred (so far)?
– (Think of elems. as keys from now on) Quickly define frequency vector:
f ∈ R|U | , (An entry for each elem. in ’universe’.) (56)
th
with i entry defined as # occurrences of xi ∈ U , in stream, so far.
As such, one might consider the problem as that of creating a smaller (≪ R|U | ) representation f˜, of f → then
query(i) should return f˜i .
– Simply algorithm achieving this is Misra-Gries - takes space budget k (size of f˜) and supports Update(i) and
Estimate(i):
6
– Obviously doesn’t always return exact count (decrements and remove counters sometimes). Lets analyze how
close it comes. It’s fairly obvious that:
and therefore we also know that every item/key i which occurs more than n/k times in stream is guaranteed to have
a counter ci > 0 in f˜.
• Now, lets generalize problem in way that requires randomization for solution (if one wants to store anything less than
entire stream).
– Specifically, we now consider possibility of performing generally sized integer updates of counters (instead of just
±1), for key i we can perform:
A:
– Now, lets determine the space budget (value of k) required to have an additive error of at most ε||f ||1 w.
probability 1 − δ - corresponding to failing w. prob. δ.
7
– Consider the value that the array in principle holds for any key i:
≡X
zX }| {
A[h(i)] = True frequency + Noise from colliding keys = fi + 1h(xj )=h(xi ) fj (62)
j : j̸=i
and specifically (as the noise X is non-negative R.V.) utilize Markov’s Inequality, to determine probability of
having too large additive error:
E X
P X > ε||f ||1 ] < (63)
ε||f ||1
and to that end bound the expectation utilizing Linearity of Expectation + definition of expectation of
discrete random variable:
=P[h(i)=h(j)]
X z h }| i{ X X 1 ||f ||1
fj 1h(j)=h(i) = fj E 1h(i)=h(j) <
X
E X =E fj P[h(i) = h(j)] ≤ fj = (64)
j j
k k
j : j̸=i j : j̸=i
such that:
1
P X > ε||f ||1 ] < (65)
εk
and then, for some given success prob. 1 − δ (and resulting failure prob. δ), and additive error factor ε, one can
always chose k:
1 1
<δ⇔k> (66)
εk εδ
– Now, it actually turns out that one can get an even smaller memory dependence on δ. In fact it suffices to chose
k: 1
1
k=O log (67)
ε δ
and what we do to achieve this dependence is usual trick of repeating data structure. Specifically just perform
t independent repetitions:
k
A1 : , h1
A2 : , h2
..
.
At : , ht
where we always return the value from the Aj w. the lowest estimate, as we pr. definition only over estimates
exact frequency (and never under estimates), and as such, this will always be result w. lowest additive error →
therefore also named Count-min Sketch.
– Now, the probability of this failing corresponds to the probability of all the individual ones failing simultaneously,
which is just the product as the hash functions are drawn independently. Now, if we impose the reasonable req.
of wanting the individual ones to fail w. at most prob. 1/2, s.t. δj = 1/2, and:
1 1
P X > ε||f ||1 ] < < (68)
εk 2
then the total prob. of failure becomes exponentially decreasing in t:
t
P X > ε||f ||1 ] < 2−t (69)
8
and then requiring this to be at most δ:
1
2−t < δ ⇔ t = log2 (70)
δ
we get the logarithmic dependency on δ.
– From here it is also evident that both Update and Estimate takes logarithmic time (in the chosen failure prob.):
1
Time : O(t) = O log2 (71)
δ
– Lets now consider General Turnstile and instead of aiming for failure prob. δ = 1/2 (as before w. Count-
min), we will star out w. aiming for δ = 1/4, and the guarantee that we will be considering can both over- and
underestimate:
sX !
fi − ε||f ||2 ≤ f˜i ≤ fi + ε||f ||2 , with the 2−norm defined as ||f ||2 ≡ fi2 (72)
i
A:
but, this time this array will be accompanied but 2 hash functions, h and g. Where h is 1-approx universal (as
before), but g is 2−wise independent, and maps to {±1}:
1
h : U → [k], with ∀x ̸= y ∈ U : Ph∼H h(x) = h(y) ≤ (73)
( k
x1 ̸= x2 ∈ U h i 1
g : U → {±1}, with ∀ , Pg∈H g(x1 ) = y1 ∧ g(x2 ) = y2 = 2 (74)
y1 , y2 ∈ {±1} k
– The general idea behind the randomization of the sign from g is that it hopefully will minimize the noise as some
of it will cancel out with other parts of it, and when multiplying w. sign from g in Estimate the sign of g from
the ’real’ value - given in update - is then cancelled (because (±1)2 = 1).
– Now, to see what value for k we must chose for eq. (72) to fail w. prob. δ = 1/4, lets repeat strategy from strict
turnstile, and consider ’theoretical’ output of Estimate:
1h(j)=h(i) fj g(j)
X
g(i)A[h(i)] = g(i) signed True frequency + signed Noise from colliding keys = g(i) g(i)fi +
j : j̸=i
1h(j)=h(i) g(i)g(j)fj
X
2
= g(i) fi + (75)
| {z }
j : j̸=i
(±1)2 =1 | {z }
≡X
and with the intention of determining the failure probability, lets start out by calculating the expectation of the
noise-part, by utilizing linearity of expectation, the fact that for any 2 independent R.V’s E[a · b] = E[a] · E[b]
and the definition of expectation of discrete random variable:
X X h i
1h(j)=h(i) g(i)g(j)fj = E 1h(j)=h(i) E g(i)g(j) fj =
X
E X =E P h(j) = h(i) E g(i) E g(j) fj = 0
j : j̸=i j : j̸=i j : j̸=i
| {z } | {z }
=0 =0
Now, because X is not in general a non-negative R.V. we cannot use Markov’s Inequality, but still Cheby-
shev’s Inequality:
h i Var[X] E[X 2 ] − E[X]2
P X − E[X] > t < 2
= (76)
t t2
9
and by virtue of the fact that E[X] = 0, we can estimate the probability of violating eq. (72) by setting t = ε||f ||2 :
h i E[X 2 ]
P |X| > ε||f ||2 < 2 (77)
ε ||f ||22
s.t. we simply need to determine (bound) E[X 2 ] in order to bound the failure prob. in terms of k, ε:
X X X
1h(j)=h(i) g(i)g(j)fj 1h(k)=h(i) g(i)g(k)fk = E 1h(j)=h(i) 1h(k)=h(i) g(i)2 g(j)g(k)fj fk
X
E X2 = E
s.t.: h i E[X 2 ] 1
P |X| > ε||f ||2 < 2 2 < 2 (79)
ε ||f ||2 ε k
and imposing failure prob. of at most δ = 1/4 we get:
4
k> (80)
ε2
however as before (with strict turnstile), this is not the optimal space usage, and once again what we do is to
create t independent copies of the data structure:
4
k= ε2
A1 : , (h1 , g1 )
A2 : , (h2 , g2 )
..
.
At : , (ht , gt )
and as before with count-min sketch in strict turnstile, we consider how many of the t arrays that has to
fail, for the Estimate to fail.
– Specifically, consider the estimates from each of the t arrays for some ith key, ordered numerically:
now, clearly, if the median is too low (< fi − ε||f ||2 ), then all vals. to left (half) is also too low, and if median is
too high (> fi + ε||f ||2 ), then all the values to the right (half) is also.
– As such, we might define (
maybe this is just bounded by prob. of at least half failing ? why =
) the failure prob. (which we want to be at most δ) as prob. of more than half the arrays failing:
t
(
h ti X 1, Aj fails.
P X> ≤ δ, X ≡ Xj , Xj ≡ (82)
2 j=1
0, o.w.
10
as we are dealing w. sum of independent 0-1 R.V.’s we utilize Chernoff ’s Inequality, and if we do, we get:
t/4
h ti e
P X> ≤ ≤δ (83)
2 4
from which the number of arrays becomes:
1
t ≥ 4 log 4e (84)
δ
3 Applications
• Lets consider a specific application of Randomized algorithms. Specifically Minimum Cut.
– Given an undirected and unweighted graph:
(
Vertices V
G ≡ (V, E), (85)
Edges E ≡ {{i, j} ∈ V : i ̸= j}
| {z }
Unordered pair
the problem of determining a Minimum cut, is defined as that of partitioning the nodes/vertices of the graph
into two disjunct sets, s.t. the number of edges between the two sets are as small as possible, e.g.:
1 2 5 6 1 2 5 6
3 4 7 8 3 4 7 8
– Also quickly define the problem of determining the Minimum s-t cut as the problem of partitioning the
nodes/vertices of the graph into two disjunct sets, s.t. the number of edges between the two sets are as small
as possible, but where it is predetermined that node s has to be in the one set and node t in the other, e.g.:
t 2 5 6 t 2 5 6
3 s 7 8 3 s 7 8
• There exists multiple ways to Deterministically compute the (global) Minimum cut, e.g. :
• A randomized approach is the alg. known as Karger’s (contraction) algorithm:
– Lets start by defining a contraction of a graph. A Contraction of 2 nodes a, b in a graph G is simply the process
of merging them into a ’super node’ ab, creating a new graph G′ :
a 2 ab 2
b 4 4
s.t. there are now 2 parallel edges from ′ ab′ to 4 (both the one from a and the one from b).
– We might repeat this process until there is only two nodes/supernodes left in the resulting graph. Now, obviously,
depending on the sequence of contractions performed, the resulting graph might have a different number of
edges between the 2 supernodes.
– As such, the naive implementations of Karger’s alg. becomes:
11
– By the assumption that it takes O(n) time to update node-edge information pr. contraction and given the fact
that we will perform n − 2 contractions, the algorithm has running time:
– Lets analyze the algorithm by considering the probability of it determining the (global) minimum cut → we
consider absolute success prob., and no approximation ratio.
– Lets define the minimum cut of a graph as the set C of edges involved in the cut → the size of the cut is then
|C|.
– Begin by making observation that if none of the edges in C are among the n − 2 (where n = |V |) contractions C
has survived and the edges connecting the two supernodes in the resulting graph corresponds to the min. cut C.
– Define:
as such - P [Sn−2 ] corresponds to having contracted none of the edges in the (global) min. cut C at the end of
the algorithm, and is given as:
h i h i h i h i
P Sn−2 = P E1 · P E2 E1 · P E3 E1 ∩ E2 · . . . · P En−2 E1 ∩ E2 ∩ . . . ∩ En−3 (87)
n−2
" i−1
#
Y \
= P Ei Ej , (Just prod. as each step is taken independently.) (88)
i=1 j=1
∗ Furthermore, at (just before) the ith contraction, the graph has n − i + 1 nodes/vertices, and at every step
in the sequence of contractions, any ith node/vertice has deg(v) ≥ |C| (else C wouldn’t be the global min.
cut). Therefore:
(n − i + 1)|C|
# Edges at ith contraction ≥ (90)
2
and we might (upper) bound the probability of picking some e ∈ C at ith contraction (without having picked
any in advance): " #
i−1
\ # Edges in min. cut 2
P Ei Ej ≤ th
= (91)
j=1
# Edges at i contraction n − i+1
– As such, the probability of not having picked any e ∈ C at the ith contraction is bounded as:
" i−1
#
\ 2
P Ei Ej ≥ 1 − (92)
j=1
n−i+1
resulting in an algorithm that takes O(n2 ) running timer, but has a polynomially decreasing success probability:
" # n−2
n−2Y i−1
\ Y 2
P Sn−2 = P Ei Ej ≥ 1−
i=1 j=1 i=1
n−i+1
−1
2 2 2 n−2n−3n−4 1 2 n
= 1− 1− ... 1 − = ... = = (93)
n n−1 3 n n−1n−2 3 n(n − 1) 2
– However, as so often before, what we can do is simply repeat the algorithm k times to improve success probability.
Specifically, say that we require failure prob to be:
h i 1
P Sn−2 ≤ (94)
n
12
we can solve for k, and find that:
k 1
2 1 ln n − ln(n)
1− ≤ ⇔k≥ 2
= 2
(95)
n(n − 1) n ln 1 − n(n−1) ln 1 − n(n−1)
and then utilizing that ln(.) is monotonically increasing, s.t. ∀a < b : ln(a) < ln(b), together w. fact that
∀x : e−x ≥ 1 − x (is obvious from considering Maclaurin expansion of ex ):
1
P Sn− √n2 ≥ (100)
2
– Idea is then that, if one only performs the first n − l contractions in each run of Karger’s algorithm, the prob.
of a specific min. cut surviving is bigger. Based on this, the Fast Cut algorithm is devised as:
– The first call takes O(n2 ) because algorithm has to perform Karger’s contraction
3 Specifically
√
it should be 1/ 1.69722 to be accurate.
13