Parallel Hashing: John Erol Evangelista
Parallel Hashing: John Erol Evangelista
10 11 12 1J 14 15
Figure 1.3. While allocating storage for the value of every possible key in an
array allows directly indexing into the structure, it is wasteful when the array
is mostly unused (top). A hash table can be used instead, which allocates far
less space than the array (bottom). In this example, each slot holds both a key
and its value. The table is indexed into using a hash function h(k). Because
multiple keys may map to the same location, the key contained in the slot and
the query key are compared on a retrieval to ensure the right value is returned.
Hash Tables
Needs to be adapted on a parallel
environment
Serialization
Memory Accesses are Slow
Many probes may be required
CUDA
stands for Compute Unied Device
Architecture
provide essential functionality for
parallel applications such as scattered
writes in memory and atomic
operations
CUDA C
high-level GPU programming language
that extends C with extra constructs for
dealing with the hardware.
How it works
Programs that run on the GPU are
called kernels and typically consist of
just a few small functions.
Kernels are executed in parallel by
threads, each performing the same
instructions on a different data.
e.g. programs computing hash function
value of every input key.
Limitation
Copying data to and from the GPU is
very expensive.
Kernels do not have access to the host
systems memory.
Solution: Use data structures that can
be built and used entirely in parallel,
allowing data to stay in the GPU while
it is being processed.
How it works
Threads are grouped into thread blocks
of up to 512 threads, which are assigned
to different streaming multiprocessors
(SM) for execution.
Thread blocks are queued up for the
SMs and fed in as the thread blocks
nish
How it works
Thread blocks can complete execution
before others are even started, so there
is no way to globally synchronize all the
threads without nishing the kernels.
Threads in the same block can locally
synchronize using execution barriers,
guaranteeing that they have all reached
the same point before continuing.
How it works
Multiple thread blocks can be handled
by SMs simultaneously, but there is a
hard limit on the number of threads the
SM can handle.
How it works
Each SM breaks its thread blocks into
groups of 32 consecutive threads called
warps.
SMs manage when each of their warps
will be executed in their SIMD cores,
with each step running the same
instruction in lockstep, even when a
branch occurs.
Types of memory
low-latency shared memory
high-latency global memory
Low latency memory
used as cache for global memory
scratchpad for threads working in the
same thread block
fast but small
partitioned; does not persist between
kernel operations
Global Memory
Abundant and shared but slow
To hide latency, SMs automatically context
switch to other warps while memory
transactions are being performed
reads up to 128-byte segments of memory
with a single transaction
memory requests of threads in a warp are
coalesced together into fewer transactions.
Atomic Operations
performed when race conditions are
difcult or impossible to avoid.
perform a series of actions that cannot
be interrupted.
e.g. incrementing a counter
Fermi architecture
higher compute capabilities, more
functionality
efcient atomic operations, cached
memory hierarchy to further reduce
latency when accessing a global
memory.
Hashing on GPU
Open Addressing
While they can be very fast for both
construction and retrieval on a GPU,
problems arise when trying to make a
compact table: in the worst case, the
whole table would have to be
traversed to terminate a query.
Hashing on a GPU
Chaining
number of probes increases greatly as
the number of slots shrinks.
linked lists are horribly inefcient in a
GPU
Hashing on a GPU
Collision-free hashing
larger space = constant probability of
no collision
increased construction time and
inherently sequential on some
implementation
Hashing on a GPU
Multiple-choice Hashing
Choose the one that has the lowest
occupancy
Cuckoo Hashing
Variation of Open Hashing, limits the
slots an item can fall to
uses multiple hash functions
Performance Metrics
Construction time
Retrieval efciency
Memory usage
Open Addressing
Race condition may occur (multiple
threads attempting to insert an item to
the same location simultaneously)
Open Addressing
18
Chapter 3
Open addressing
In this chapter, we explore a parallelized implementation of open addressing hash
tables. These hash tables consist of S
T
N slots, where each slot of the table can
store one of the N items from the input. In order to insert an item with key k, a
series of probes is performed to nd an empty slot, beginning with the slot h(k);
the item is immediately inserted into the earliest possible empty slot in the series
(Figure 3.1).
These hash tables are commonly used in sequential computing, but they face
some issues in highly parallel GPU environments. Serial implementations, for ex-
ample, have no chance of losing any items during insertion: nothing can be inserted
into the slot between checking the slot and actually writing the item in. However,
race conditions like this exist in parallel implementations because multiple threads
may be attempting to insert items into the same locations simultaneously.
Figure 3.1. Examples of linear probing (left) and quadratic probing (right).
Open Addressing
The parallel construction assigns each
input item to a thread, then has each
thread simultaneously probe the hash
table for empty slots
force serialization of access to the table
Parameters
Number of slots: S
T
N where S
T
is the
number of slots and N is the number of
items in the input. S
T
1.25N
Probe Sequence
20
Probing scheme Hash function
Linear probing h(k) = g(k) + iteration
Quadratic probing h(k) = g(k) + c
0
iteration + c
1
iteration
2
Double hashing h(k) = g(k) + jump(k) iteration
Table 3.1. Open addressing hashing schemes
The probe sequence determines the order in which the slots are examined when
trying to either insert or nd an item; the three common ones are shown in Ta-
ble 3.1.
Linear probing advances along the slots one at a time. While linear probing
hash tables can take advantage of caches because neighboring slots are vis-
ited, the problem is that items that can cluster around a particular slot. This
can cause extremely long probe sequences until an empty slot is found, which
is exacerbated when the hash table becomes more and more full. However,
it is guaranteed to visit every slot in the table.
Quadratic probing mitigates the clustering issue by using a sequence gener-
ated by a quadratic function, which takes longer and longer leaps after each
probe. Typically, c
0
= 0 and c
1
= 1 for the hash function. There is still an
issue when a large number of items hash into the same slot, however: since
all of these items use the exact same probe sequence, they will repeatedly
collide.
Double hashing generates probe sequences that are tailored for each key.
To do this, it uses a second hash function to determine the jump taken on
collision, making it unlikely for items colliding to be taking the same sequence
of jumps.
Both quadratic probing and double hashing are able to nd empty slots much
more easily than linear probing, as they can jump out of tight clusters in the
Parameters
Maximum allowed length of Probe
Sequence. Used to terminate a probe
sequence that is taking too much time.
Min(N,10000).
Hash Function
Perfect Hash Function. Benets are
minimal since the hash tables can be
constructed in a way that effectively
limits the number of probes required to
nd an item to just one or two
Simple randomized hash functions
work well in practice
Hash Function
g(k) = (f(a,k) + b) mod p mod S
T
Where a and b are randomly generated
constant, p is a prime number and S
T
is
the number of slots available in the
hash table
Implementation
24
Algorithm 3.1 Process for creating an open addressing hash table.
1: allocate enough memory for table[ ], which will contain S
T
64-bit slots
2: repeat
3: ll each slot with
4: generate a new hash function for the current attempt
5: for all key-value pairs (k, v) in the input do
6: repeat
7: atomically check-and-set table[location]
8: advance location to next location in probe sequence
9: until is found or max probes hit
10: end for
11: until hash table is built
atomic operations instead, with extra steps for moving the values into the right
location after their corresponding keys have been inserted.
Threads complete their work when they successfully place their item, but a
thread block will not complete until all of its threads are done. Thus we choose
a relatively small thread block size 64 threads that minimizes the number of
threads within a block kept alive by a single threads long probe sequence.
Double hashing requires computing a second hash function, jump(k), whose
value is stored to prevent having to recompute it after each collision. Care must
be taken so that the jump distance is greater than 0 to prevent an item from
repeatedly failing to insert itself into the same occupied slot. We use the simple
jump(k) = 1 + (k mod jump prime), where jump prime = 41. Lowering this
prime forces the double hashing scheme to take smaller jumps, which speeds up
retrievals since all locations are more likely to be cached, while using larger primes
decreases the construction time instead, since the scheme can jump out of crowded
areas more quickly. The function worked well for larger hash tables, but it did
fail many times for hash tables containing less than 5K elements, which requires
smaller jumps.
25
Listing 3.1. Parallel insertion of items into an open addressing table.
1 d e v i c e bool i ns e r t e nt r y ( const unsigned key ,
2 const unsigned val ue ,
3 const unsigned t a bl e s i z e ,
4 Entry t abl e ) {
5 // Manage t he key and i t s val ue as a s i ng l e 64b i t ent ry .
6 Entry ent ry = ( ( Entry ) key << 32) + val ue ;
7
8 // Fi gure out where t he i tem needs t o be hashed i nt o .
9 unsigned i ndex = has h f unct i on ( key ) ;
10 unsigned doubl e hash j ump = j ump f uncti on ( key ) + 1;
11
12 // Keep t r y i ng t o i ns e r t t he ent ry i nt o t he hash t a b l e
13 // u nt i l an empty s l o t i s f ound .
14 Entry ol d e nt r y ;
15 for ( unsigned attempt = 1; attempt <= kMaxProbes ; ++attempt ) {
16 // Move t he i ndex so t hat i t poi nt s somewhere wi t hi n t he t a b l e .
17 i ndex %= t a b l e s i z e ;
18
19 // At omi cal l y check t he s l o t and i ns e r t t he key i f empty .
20 ol d e nt r y = atomicCAS( t abl e + i ndex , SLOT EMPTY, entry ) ;
21
22 // I f t he s l o t was empty , t he i tem was i ns e r t e d s a f e l y .
23 i f ( ol d e nt r y == SLOT EMPTY) return t r ue ;
24
25 // Move t he i ns e r t i on i ndex .
26 i f ( method == LINEAR) i ndex += 1;
27 el se i f ( method == QUADRATIC) i ndex += attempt attempt ;
28 el se i ndex += attempt doubl e hash j ump ;
29 }
30
31 return f a l s e ;
32 }
Parallel Retrieval
Follows same search pattern as
construction
Construction Rates
27
100
1S0
200
2S0
a
|
s
p
e
s
e
c
o
n
(
)
M
|
|
|
|
o
n
s
uouble hashlng - Cuery
Cuaoratlc problng - Cuery
uouble hashlng - 8ullo
0
S0
100
S12 4096 32768 262144 20971S2 16777216
a
|
s
p
e
s
e
c
o
n
(
)
Input s|ze (|og sca|e)
Cuaoratlc problng - 8ullo
Llnear problng - 8ullo
Llnear problng - Cuery
200
2S0
300
3S0
400
4S0
S00
a
|
s
p
e
s
e
c
o
n
(
)
M
|
|
|
|
o
n
s
Cuaoratlc problng - Cuery
uouble hashlng - Cuery
Llnear problng - Cuery
0
S0
100
1S0
200
S12 4096 32768 262144 20971S2 16777216
a
|
s
p
e
s
e
c
o
n
(
)
Input s|ze (|og sca|e)
Cuaoratlc problng - 8ullo
uouble hashlng - 8ullo
Llnear problng - 8ullo
Figure 3.2. Eect of input size on construction retrieval rates for tables con-
taining 1.25N slots on both the GTX 280 (top) and 470 (bottom).
rates using all three probing methods becomes more or less at, meaning that
the time required to construct the table increases linearly with the input size.
Performance for all three methods on the GTX 470 is roughly double that of the
GTX 280; one likely cause is the speed boost atomic operations received on Fermi
cards. Linear probing does consistently worse than both quadratic probing and
double hashing, reecting the problems linear probing encounters when trying to
escape crowded areas of the table. Double hashing has a slight edge over quadratic
probing because it can jump away more readily.
Retrieval rates measure how quickly the hash table can be queried for all of
the input items in parallel, with each query assigned to a dierent thread. We
see similar trends here, which is expected because retrievals mimic the insertion
process without any slow atomic operations. All three methods get a performance
boost, with quadratic probing and double hashing getting a more than 2x boost
on the GTX 470. Although linear probing still lags behind the other methods,
Memory Usage
28
130
200
230
300
a
|
s
e
s
e
o
n
1
M
|
|
|
|
o
n
s
uouble hashlng - Cuery
CuadraLlc problng - Cuery
Llnear problng - Cuery
0
30
100
1.0n 1.3n 2.0n 2.3n 3.0n 3.3n 4.0n
a
|
s
e
s
e
o
n
1
1ab|e s|ze
uouble hashlng - 8ulld
CuadraLlc problng - 8ulld
Llnear problng - 8ulld
400
S00
600
700
800
900
a
|
s
e
s
e
o
n
1
M
|
|
|
|
o
n
s
Llnear problng - Cuery
Cuaoratlc problng - Cuery
uouble hashlng - Cuery
0
100
200
300
1.0n 1.Sn 2.0n 2.Sn 3.0n 3.Sn 4.0n
a
|
s
e
s
e
o
n
1
1ab|e s|ze
Cuaoratlc problng - 8ullo
uouble hashlng - 8ullo
Llnear problng - 8ullo
Figure 3.3. Eect of the table size on construction and retrieval rates for tables
containing 10 million items.
it gets an even bigger 4x performance boost. The sharp decline in retrieval rates
for the GTX 280 for larger input sizes does not appear when using two separate
32-bit arrays to store the keys and values, but storing the data this way only hurt
performance on the GTX 470.
Quadratic probing has an obvious advantage over double hashing on the GTX
470, which is a direct result of the jump function chosen for double hashing. As
mentioned earlier, allowing larger jumps decreases construction times because the
average number of probes required to insert an item decreases. Conversely, decreas-
ing the jump size allows taking advantage of the cache, speeding up the retrievals.
Our GTX 280 results corroborate this: double hashing consistently performed
slightly better than quadratic probing even for larger jumps, suggesting that the
cache is able to reduce the memory trac slightly.
Figure 3.3 shows the eect of modifying the table size while keeping the input
size xed at 10 million items, eectively changing the load of the hash table and
Limitations
Performance drops signicantly for
compact tables
High variability in probe sequence
length
Removing items from the table.
Sources
Alcantara, D., Efcient Hash Tables on a
GPU.