0% found this document useful (0 votes)
47 views7 pages

1 Di Erence Between Grad and Undergrad Algorithms: Lecture 1: Course Intro and Hashing

This document discusses hashing algorithms. It introduces the basics of hashing including hash tables and dealing with collisions. It then discusses assumptions made about inputs and outputs. Next, it defines what is meant by random hash functions and different levels of independence. It concludes by defining 2-universal hash families and providing an example construction.

Uploaded by

ylw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views7 pages

1 Di Erence Between Grad and Undergrad Algorithms: Lecture 1: Course Intro and Hashing

This document discusses hashing algorithms. It introduces the basics of hashing including hash tables and dealing with collisions. It then discusses assumptions made about inputs and outputs. Next, it defines what is meant by random hash functions and different levels of independence. It concludes by defining 2-universal hash families and providing an example construction.

Uploaded by

ylw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

princeton univ.

F’13 cos 521: Advanced Algorithm Design


Lecture 1: Course Intro and Hashing
Lecturer: Sanjeev Arora Scribe:Sanjeev

Algorithms are integral to computer science and every computer scientist (even as an
undergrad) has designed several algorithms. So has many a physicist, electrical engineer,
mathematician etc. This course is meant to be your one-stop shop that teaches you how
to design a variety of algorithms. The operative word is “variety. ”In other words you will
avoid the blinders that one often sees in domain experts. A bayesian needs priors on the
data before he can design algorithms; an optimization expert wishes to cast all problems
as mathematican optimization; a systems designer has never seen any problem that cannot
be solved by hashing. (OK, mostly kidding but the joke does reflect truth to some degree.)
These and more domain-specific ideas make an appearance in our course, but we will learn
to not be wedded to any single approach.
The primary skill you will learn in this course is how to analyse algorithms: prove their
correctness and their running time and any other relevant properties. Learning to analyse a
variety of algorithms (designed by others) will let you design better algorithms later in life.
I will try to fill the course with beautiful algorithms. Be prepared for frequent rose-smelling
stops, in other words.

1 Di↵erence between grad and undergrad algorithms


Undergrad algorithms is largely about algorithms discovered before 1990; grad algorithms
is a lot about algorithms discovered since 1990. OK, I picked 1990 as an arbitrary cuto↵.
Maybe it is 1985, or 1995. What happened in 1990 that caused this change, you may
ask? Nothing. It was no single event but just a gradual shift in the emphasis and goals of
computer science as it became a more mature field.
In the first few decades of computer science, algorithms research was driven by the goal of
understanding how to design basic components of a computer: operating systems, compilers,
networks, etc. Other motivations to study algorithms came out of discrete mathematics,
operations research, graph theory. Thus in undergraduate algorithms you would study
data structures, graph traversal, string matching, parsing, network flows, etc. Starting
around 1990 theoretical computer science broadened its horizons and started looking at new
problems: algorithms for bioinformatics, algorithms and mechanism design for e-commerce,
algorithms to understand big data or big networks. This changed algorithms research and
the change is ongoing. One big change is that it is often unclear what the algorithmic problem
even is. Identifying it is part of the challenge. Thus good modeling is important. This in
turn is shaped by understanding what is possible (given our understanding of computational
complexity) and what is reasonable given the limitations of the type of inputs we are given.

1
2

Some examples of this change:

The changing graph. In undergrad algorithms the graph is given and arbitrary (worst-
case). In grad algorithms we are willing to look at where the graph came from (social
network, computer vision etc.) since those properties may be germane to designing a good
algorithm. (This is not a radical idea of course but we will see that formulating good graph
models is not easy. This is why you see a lot of heuristic work in practice, without any
mathematical proofs of correctness.)

Changing data structures: In undergrad algorithms the data structures were simple
and often designed to hold data generated by other algorithms. A stack allows you to hold
vertices during depth-first search traversal of a graph, or instances of a recursive call to a
procedure. A heap is useful for sorting and searching.
But in the newer applications, data often comes from sources we don’t control. Thus it
may be noisy, or inexact, or both. It may be high dimensional. Thus something like heaps
will not work, and we need more advanced data structures.
We will encounter the “curse of dimensionality”which constrains algorithm design for
high-dimensional data.

Changing notion of input/output: Algorithms in your undergrad course have a simple


input/output model. But increasingly we see a more nuanced interpretation of what the
input is: datastreams (useful in analytics involving routers and webservers), online (sequence
of requests), social network graphs, etc. And there is a corresponding subtlety in settling
on what an appropriate output is, since we have to balance output quality with algorithmic
efficiency. In fact, design of a suitable algorithm often goes hand in hand with understanding
what kind of output is reasonable to hope for.

Type of analysis: In undergrad algorithms the algorithms were often exact and work on
all (i.e., worst-case) inputs. In grad algorithms we are willing to relax these requirements.
Advanced Algorithm Design: Hashing
Lectured by Prof. Moses Charikar
Transcribed by Linpeng Tang∗
Feb 2nd, 2013

1 Preliminaries
In hashing, we want to store a subset S of a large universe U (U can be very
large, say |U | = 232 is the set of all 32 bit integers). And |S| = m is a relatively
small subset. For each x ∈ U , we want to support 3 operations:
• insert(x). Insert x into S.
• delete(x). Delete x from S.
• query(x). Check whether x ∈ S.

U
h
n elements

Figure 1: Hash table. x is placed in T [h(x)].

A hash table can support all these 3 operations. We design a hash function
h : U −→ {0, 1, . . . , n − 1} (1.1)
such that x ∈ U is placed in T [h(x)], where T is a table of size n.
Since |U | $ n, multiple elements can be mapped into the same location in
T , and we deal with these collisions by constructing a linked list at each location
in the table.
One natural question to ask is: how long is the linked list at each location?
We make two kinds of assumptions:
[email protected]

1
1. Assume the input is the random.

2. Assume the input is arbitrary, but the hash function is random.

Assumption 1 may not be valid for many applications, since the input might
be correlated.
For Assumption 2, we construct a set of hash functions H, and for each
input, we choose a random function h ∈ H and hope that on average we will
achieve good performance.

2 Hash Functions
Say we have a family of hash functions H, and for each h ∈ H, h : U −→ [n]1 ,
what do mean by saying these functions are random?
For any x1 , x2 , . . . , xm ∈ S (xi $= xj when i $= j), and any a1 , a2 , . . . , am ∈
[n], ideally a random H should satisfy:
1
• Prh∈H [h(x1 ) = a1 ] = n.
1
• Prh∈H [h(x1 ) = a1 ∧ h(x2 ) = a2 ] = n2 . Pairwise independence.
1
• Prh∈H [h(x1 ) = a1 ∧ h(x2 ) = a2 ∧ · · · ∧ h(xk ) = ak ] = nk
. k-wise indepen-
dence.
• Prh∈H [h(x1 ) = a1 ∧ h(x2 ) = a2 ∧ · · · ∧ h(xm ) = am ] = n1m . Full indepen-
dence (note that |U | = m). In this case we have nm possible h (we store
h(x) for each x ∈ U ), so we need m log n bits to represent the each hash
function. Since m is usually very large, this is not practical.

For any x, let Lx be the length of the linked list containing x, then Lx is just
the number of elements with the same hash value as x. Let random variable
!
1 if h(y) = h(x),
Iy = (2.1)
0 otherwise.
"
So Lx = 1 + y"=x Iy , and
# m−1
E[Lx ] = 1 + E[Iy ] = 1 + (2.2)
n
y"=x

Note that we don’t need full independence to prove this property, and pairwise
independence would actually suffice.
1 We use [n] to denote the set {0, 1, . . . , n − 1}

2
3 2-Universal Hash Families
Definition 3.1 (Cater Wegman). Family H of hash functions is 2-universal if
for any x != y ∈ U ,
1
Prh∈H [h(x) = h(y)] ≤ (3.1)
n
Note that this property is even weaker than 2 independence.
We can design 2-universal hash families in the following way. Choose a prime
p ∈ {|U |, . . . , 2|U |}, and let

fa,b (x) = ax + b mod p (a, b ∈ [p], a != 0) (3.2)

And let
ha,b (x) = fa,b (x) mod n (3.3)

Lemma 3.2. For any x1 != x2 and s != t, the following system

ax1 + b = s mod p (3.4)


ax2 + b = t mod p (3.5)

has exactly one solution.


Since [p] constitutes a finite field, we have that a = (x1 − x2 )−1 (s − t) and
b = s − ax1 . Since we have p(p − 1) different hash functions in H in this case,
1
Prh∈H [h(x1 ) = s ∧ h(x2 ) = t] = (3.6)
p(p − 1)

Claim 3.3. H = {ha,b : a, b ∈ [p] ∧ a != 0} is 2-universal.

Proof. For any x1 != x2 ,

Pr[ha,b (x1 ) = ha,b (x2 )] (3.7)


!
= δ(s=t mod n) Pr[fa,b (x1 ) = s ∧ fa,b (x2 ) = t] (3.8)
s,t∈[p],s#=t
1 !
= δ(s=t mod n) (3.9)
p(p − 1)
s,t∈[p],s#=t

1 p(p − 1)
≤ (3.10)
p(p − 1) n
1
= (3.11)
n
where δ is the Dirac delta function. Equation (3.10) follows because for each
s ∈ [p], we have at most (p − 1)/n different t such that s != t and s = t
mod n.

3
Can we design a collision free hash table then? Say we have m elements,
and the hash table is of size n. Since for any x1 != x2 , Prh [h(x1 ) = h(x2 )] ≤ n1 ,
the expected number of total collisions is just
! ! " #
m 1
E[ h(x1 ) = h(x2 )] = E[h(x1 ) = h(x2 )] ≤ (3.12)
2 n
x1 !=x2 x1 !=x2

Let’s pick m ≥ n2 , then


1
E[number of collisions] ≤ (3.13)
2
and so
1
Prh∈H [∃ a collision] ≤ (3.14)
2
So if the size the hash table is large enough m ≥ n2 , we can easily find a
collision free hash functions. But in reality, such a large table is often unrealistic.
We may use a two-layer hash table to avoid this problem.

0
1

si elements
i
s2i locations

n−1

Figure 2: Two layer hash tables.

Specifically, let si denote the number of collisions at location i. If we can


construct a second layer table of size s2i , we can easily find a collision-free hash
table to store
$m−1all the si elements. Thus the total size of the second-layer hash
tables is i=0 s2i .
$m−1
Note that i=0 si (si − 1) is just the number of collisions calculated in
Equation (3.12), so
! ! ! m(m − 1)
E[ s2i ] = E[ si (si − 1)] + E[ si ] = + m ≤ 2m (3.15)
i i i
n

4 Load Balance
In load balance problem, we can imagine that we are trying to put balls into
bins. If we have n balls and n bins, and we randomly put the balls into bins,

4
then for a give i,
! "
n 1 1
Pr[bini gets more than k elements] ≤ · k ≤ (4.1)
k n k!

By Stirling’s formula,
√ k
k! ∼ 2nk( )k (4.2)
e
If we choose k = O( logloglogn n ), we can let 1
k! ≤ 1
n2 . Then

1 1
Pr[∃ a bin ≥ k balls] ≤ n · 2
= (4.3)
n n
12
So with probability larger than 1 − n ,

log n
max load ≤ O( ) (4.4)
log log n
Note that if we look at 2 random bins when a new ball comes in and put
the ball in the bin with fewer balls, we can achieve maximal load at the scale of
O(log log n), which is a huge improvement.

2 this 1
can be easily improve to 1 − nc
for any constant c

You might also like