Paper 9
Paper 9
Abstract — One disadvantage of bitstate hashing, the Increasing the length of the hash table is conducted
traditional probabilistic state space analysis method, is the frequently to reduce address collisions. Furthermore, hashing
possible large amount of wasted memory for achieving a high added with relocation or grouping of data in a complementary
probability of no address collision in hashing. We look at trade- data structures to resolve collisions, called chaining, is now a
off between hashing speed and memory assigned for hashing. common practice despite its limited efficiency [5] and
This paper presents a method using multiple hash tables and unknown length of the structures.
special hash functions and shows relevant results not just in
increasing the probability of no address collision but in reducing Multiple hash tables, seen as the partition of one hash table
the memory for hashing. and not as complementary data structures or additional strategy
to resolve collisions, are the topic of our research. Their usage
Keywords — State space analysis, probablistic method, bitstate is not new; for example in [6] they are used to reduce the
hashing, address collision, Petri nets overall contention probability practically without much
overhead. It is for the problem of partitioning a given data set
I. INTRODUCTION and their hashing into separate regions. More recently, shared
hash tables in parallel model checking were investigated in [7],
Among all approaches to solve the problem of determining
showing that communication overhead plays a role less
the membership of the element x in a set K, i.e. the dictionary
important in scalability of shared-memory implementation.
problem, hashing seems to be one of the most efficient.
And [8] explores the benefits of multi-core model checkers
If the set of elements in K is static, then in one probe it is with concurrent hash tables as shared state storage.
possible to compute an injective function h(x) to determine
This paper shows a method for partitioning the hash table
membership. A function capable of this is called perfect hash
into multiple hash tables and uses special hash functions for
function [1].
each table in order to eliminate address collisions. We show
There is one special problem to discuss in this paper. It is that when using our special hash function with non-static sets
when the set K is not static and the number of elements to of keys, our method can have hashing in one probe and a
determine membership is limited by the memory requirements higher probability of no address collision compared with the
to represent all the elements. It is a more specialized problem traditional method. Furthermore, we present how our method is
related to state space (SSp) and its partial analysis using useful for reducing the memory required for hashing.
probabilistic methods.
First, instead of using one large static hash table, memory
Holzmann’s bitstate hashing (BSH) is the traditional SSp partitioning is carried out off-line for the design of multiple
partial analysis method [2]. A non-injective hash function hash tables, where the sizes of the tables follow a normal
generates a hash address with a uniform distribution of probability distribution.
collisions between addresses (2 or more elements with the
Second, the set of keys is partitioned according to a sum-of-
same hash address), and stores the address in a hash table by
digits function and keys are assigned to the multiple hash tables
the use of one bit value. Membership and existence of an
using that function. A special hash function is associated with
element is determined with only that bit.
each table and used to hash the corresponding keys. We look at
Despite all benefits of BSH, it is hampered by one flaw. It injective functions through congruence relation for their design.
is the critical assumption about the uniform distribution of the In this paper we show that, in the case of selecting a number of
keys used as a parameter of the hash function. Keys are often slots in the hash table not large enough to have a high
skewed, not uniformly distributed [3-4]. Furthermore, since probability of no collisions in a traditional hash table, the same
their number is large by nature, the length m of the BSH table number of slots distributed among multiple hash tables may be
needs to be much larger than the number n of keys in order to already enough to confirm membership in one probe and have
have a high probability of no address collision, making very zero address collision.
difficult the choice of the length of the table.
Finally, aware that many times sets of skewed keys may
exist, our method is memory efficient because we look at
reducing the required memory space by creating in real-time For ease of explanation, let us use a small example.
only the hash tables that are needed when they are needed, Imagine a SSp with size n=5, and we design a table with m=20.
improving the usability of the memory by hashing more key in If two keys are 3 and 6, then there is one collision in the
the reserved space. address 2. A simpler injective function like h(k)=k could
produce zero collision if the range of keys is from 1 to 20, but
This paper presents relevant results for BSH method using if the next key is 21, then there is no more space in the table.
our approach for the partial analysis of the SSp. The section 2
of this paper explains BSH and collisions. Our proposal is in BSH is useful when a complete analysis of all the SSp is
the section 3. The section 4 shows the analysis of our approach. infeasible, meaning its exhaustive exploration is impossible.
Later, evaluation of the new method is presented in section 5. Then assuming a randomly driven exploration is conducted, the
At the end we give some conclusions. addresses generated with the non-injective function for the
presumably uniform distribution of keys, without regard for
The reader should note that probability of no address their range, are evenly distributed.
collision at a fixed consumption of memory is used for
evaluating our approach. Memory access speed and It is widely accepted keys are often not uniformly
computability is not reported in this paper. distributed, but it is easier to assume uniformity for the hash
function than dealing with not uniform keys [2]. This is the
II. BITSTATE HASHING AND COLLISIONS main argument for presenting our proposal. If the set of keys
can be partitioned according to some type of skeweness, the
Hashing is a scatter storage technique of records distributed hash table can be partitioned as well. Special hash functions for
over a prescribed range of addresses [2]. It uses a hash function those partitions may produce less address collisions. And by
h mapping a set K of keys, subset of the natural numbers, with individually creating the hash tables in real-time ad hoc as they
m hash addresses. are needed, the tables which are never created represent the
savings in memory space.
h : K {0, 1, ..., m–1} (1)
III. STATE SPACE, HASH TABLES AND KEYS
Hash functions need to be efficiently computable and The set of non-negative integers is Z = {0, 1, 2, ...}. Let us
distribute the hash addresses evenly in order to avoid a large define a set K ⊆ Z of unique integer keys. A key is a state
number of collisions. The traditional hash function to generate descriptor of a SSp. The SSp has n states.
uniformly distributed addresses is by arithmetic operation of
division, where m is the number of slots in the table, k divides The digits of a key are numbered α1, α2, ...; e.g. k=47293
m and the residue is the respective address h(k) [3-4]. then α1=4, α2=7, α3=2, α4=9 and α5=3. The length of a key
(number of digits) is the function d(k). The sum-of-digits
Perfect hash functions are such that allow us to find any key α1+α2+...+αi of a key is the function s(k).
in the hash table in just one probe. If the function allows us to
store a set of records in a table of size m equal to the number of Lack of uniformity in K is frequently mentioned in many
keys, the function is called a minimal perfect hash function manuscripts, but the skewness is not characterized. Skewness
because it uses the minimum amount of memory [1]. of a set of keys can be determined by the distribution of keys
along an axis. Fig. 1 shows 3 common types of key skewness
BSH does not perform explicit storing of each record. where black lines represent the distributions of the SSp, and the
Instead it uses a single bit to represent the existence of a record, grey lines the distribution of their superset (or finite universe).
where the value of 1 indicates that the record exists and 0 is the
opposite, Its strength is the ability to maximize coverage in the
face of limited memory, and not in its ability to provide
complete SSp coverage. Its weaknesses are the possible large
portions of reserved memory for a hash table that might end up
not being used.
The category of probabilistic method is given from the
uncertainty of omissions and collisions of records. For
achieving a high probability of no collision in pnc=e-γ, the ratio
γ = n2/2m needs to be close to zero [2]. For example, allowing a
probability of no collision of 80% for the exploration of a SSp
with n=1012 states we need 2.2×1024 bits, and for 99% we need
5.0×1025 bits.
It appears illogical to reserve a large space of memory for
accommodating with a non-injective function a number of
Figure 1. Positive (left), negative (right) and bimodal (bottom) skewness.
states which is not only smaller than the memory space but also
than the size of the SSp itself. But BSH is used because even The skeweness treated in this paper is such that the sum-of-
though the number of states n is fixed, many times we cannot digits of all the keys from the states of the SSp is the same as
foresee the range of the set of keys. Another reason is that the some numbers.
size of the set is not static.
A. Multiple Hash Tables This special hash function is based on the method for
BSH uses a logarithm function to determine the m-size of a casting-out-nines. Its properties are universally known,
traditional one-bit hash table. BSH with multiple hash tables therefore no further explanation is given.
needs one normal probability density function to determine the
sizes of the tables. IV. ANALYSIS OF THE APPROACH
The number of multiple tables depends on the set of keys. In this paper we analyzed the number of collisions of
One-bit hash table is created only as it is needed during the addresses in the traditional key to hash table (K2T) and in our
exploration. The maximal number of tables is the sum-of-digits key to multiple hash tables (K2MT) for hashing a SSp with set
of the upper-bound key kmax shown in (2). of keys with specific size n.
In a paper in [13] we first analyzed the case when the
d max −1 d max − 2 0 number of slots m is less than the numerical value of the key
α 1 × 10 + 9 × 10 + + 9 × 10 (2) kmax. By using a casting-out-nines transformation of the keys
and an inverted-values modulo function as hash function, we
Since we ignore the range of the set of keys K, the value of observed a large reduction in the number of address collisions
the key kmax is declared as follow. This article assumes the in many skewed sets of keys. This paper focus on the
reader is familiarized with Petri Net (PN) theory; explanation complementary case, when the number of slots m is greater
of their semantics is not given. than the numerical value of the key kmax, by using only casting-
out-nines as the hash function.
Let us define the key kmax as the maximal value of all keys
in a finite universe Ω of keys (K ⊆ Ω) and it may not exist in K, We based our analysis starting at the desired probability of
meaning the relation kmax ∈ K is not necessary true. Its length is no collision pnc in a K2T at a fixed size of table with
dmax and its sum-of-digits is smax. Any state of the SSp can be calculation of m = −n2/(2×ln(pnc)). Let us use the same example
seen using the PN theory as i places (p1, p2, ..., pi and i=1, 2, ..., in [2]. Think in a bounded PN with 1000 places and maximal
dmax) all having token capacity of 9. In this context, a marking token capacity less than 256. The author considers a SSp of
represents a key and kmax is the marking with each place n=106 states, but the knowledge of n does not relate to its range.
holding a number of tokens equal to 9, except for p1 which However, from the universe of 2561000 keys we can deduct kmax,
holds the left-most significant digit of kmax. α1=2 and s(kmax)= 2+9×999+9×103+9×103=26993.
Let us assume the variable X represents the number of Hypothetically speaking about K2MT, if all keys have
tokens in a marking. A histogram of X for all possible different sum-of-digits, then n tables are created and each one
combinations of markings (the universe of ω keys) appears like hashes one key only. If n < s(kmax)+1, then the memory saving
a normal probability distribution with µ̂ = s(kmax)/2 and σˆ = comes from the hash tables which are not created. In a more
s(kmax)/6 (i.e. X~N( µ̂ , σˆ ) [9, 10]). Only a test of goodness of unfortunate case, if n > s(kmax)+1 then there is a chance that all
tables are created. This condition describes the logic for the
fit can confirm it, but this characterization is already referred in following analysis.
[11, 12].
When all the n keys have the same sum-of-digits then the
With our proposed approach, the maximal number of hash probability of no collisions pnc is 100% if for all tables j=0 to
tables is s(kmax)+1. The partition of m slots into multiple hash
tables is calculated according to the formula in (3). All the s(kmax) the relation n ≤ m × N ( j ; µˆ , σˆ ) holds. More in
tables together have a combined length of approximately m general, just for the tables associated to keys with sum-of-digits
slots normally distributed. x holding that condition, the probability of no collision pnc is
proposal is inferior only when n=104, but for larger SSp, like A. Range of the Addresses Versus Number of Slots
n=105 and n=106, our proposal is better, meaning not only a The special hash functions presented here create unique
probability of no collision of 100% but also using less number addresses for keys in one table. All the addresses in one table
of slots in the tables. are nonconsecutive, nonlinear and divisible by 9.
For analysis purpose, we may extend the representation of a Allow us to roughly assign kmax as the maximal address in
SSp with the uniform probability distribution U(0,s(kmax)). For any table, and let the number of slots in all the multiple hash
approaching the probability of no address collision to 100% tables to be the minimum one that may exist, i.e.
with our method, in tables for keys with sum-of-digits x, in m × N ( x; µˆ , σˆ ) with x=0 or s(kmax). When the size of the SSp is
general the following relation may hold when the size of the
SSp is large: proportionally large with respect to the size ω of the finite
universe Ω of keys, at high values of pnc the result in the
− n2 calculation of m may be a number of slots so large that even
n × ( s(k max ) + 1) × U ( x;0, s(k max )) < × N ( x; µˆ , σˆ ) . m × N ( x; µˆ , σˆ ) > kmax. Then for the keys with same sum-of-
2 ln( pnc )
digits assigned to one K2MT table, by using the special hash
functions, the range of the addresses is covered by the length of
the table.
For a better explanation, let us use different SSp with sizes
from n =115292150460 states to 115292150460684. For two
values of pnc, 80% and 90%, we calculated the minimum
number of slots (Min m). The results are in the fig. 3. To higher
probabilities of no address collision pnc, faster is the growth in
the number of necessary slots.
V. EVALUATION
Aware of the benefits of our K2MT arrangement and the
special hash functions that achieve no address collision, the
element used in this paper to judge practical benefit is the
memory used against the traditional K2T method.
We used relatively large SSp, with sizes from n = 419430
states to 629145, representing 40 to 60% the size of a universe
Figure 2. Probability of no collision in K2T and K2MT. of keys ω. The probability of no collision pnc was set to 99%.
Normal and uniformly distributed random samples of keys The trade-off between hashing speed and reducing the
were prepared for the evaluation. required memory for hashing may have a positive impact in
reducing the number of address collision in the SSp analysis.
To determine the amount of memory that can be reduced,
which in our case is the sum of the number of slots from tables Evaluation of the computational overhead will be among
not created by our method, the same sets of keys were hashed the next steps of our research. Despite multiple hash tables may
to K2T and K2MT tables. be in disadvantage compared with the traditional hash table, it
is expected not to be an important drawback because our
The results for a number of slots m obtained from n and pnc method is oriented towards SSp analysis with parallel
for the set of normally distributed keys are in the figure 4. The processing, allowing multi-cores to conduct simultaneous
dashed line represents the number of slots necessary for the explorations.
traditional K2T. The continuous line represents the number of
slots in K2MT created to hash the keys. A few tables in K2MT
were not created, which means that the amount of necessary REFERENCES
space is approximately 20% less than in the traditional K2T. [1] Zbigniew J. Czech, George Havas and Bohdan S. Majewski, “Perfect
hashing,” Theoretical Computer Science, vol. 182, Issues 1-2, pp. 1-143,
Thus, the portion of memory used from m by our method 1997.
was only 80% because not all the tables were necessary. Even [2] Matthias Kuntz and Kai Lampka, “Probabilistic methods in state space
more, while our method shows no address collision for analysis,” Springer Verlag, LNCS, vol. 2925, pp. 339-383, 2004.
normally distributed keys, the traditional method showed an [3] Werner Buchholz, “File organization and addressing,” IBM Systems
average of 68% of address collisions. Journal, vol. 2 ,issue: 2, pp. 86-111 (1963).
[4] D. E. Knuth, The Art of Computer Programming: Sorting and Searching,
For the uniformly distributed set of keys, the reduction of 2nd ed., vol. 3. Addison-Wesley Longman, 1998.
memory is only about 19%. The average of address collision [5] Nikolas Askitis, “Fast and compact hash tables for integer keys,” Proc.
presented by the traditional method was 65%. of the 32nd Australasian Computer Science Conference (ACSC 2009),
Wellington, New Zealand, January 2009.
[6] Ming-Syan Chen, “Optimal design of multiple hash tables for
concurrency control,” IEEE Transactions on Knowledge and Data
Engineering, vol. 9, issue 3, pp. 384-390 (1997).
[7] S. Caselli, G. Conte and P. Marenzoni, “Parallel state space exploration
for GSPN models,” Applications and Theory of Petri Nets 1995,
Springer Verlag, LNCS 935, pp. 181-200, 1995.
[8] Alfons Laarman, Jaco van de Pol and Michael Weber, “Boosting multi-
core reachability performance with shared hash tables,” Formal Methods
in Computer-Aided Design (FMCAD) 2010, pp. 247-255, October 2010.
[9] Eleazar J. Serrano, “Hypothesis test for unsuccesful partial explorations
of state spaces of Petri nets,” Proc. of the World Congress on
Engineering 2011 (WCE 2011), London, UK, July 2011.
[10] Eleazar J. Serrano, “Partial exploration of state spaces and hypothesis
test for unsuccesful search,” Electrical Engineering and Intelligent
Systems, Springer Verlag, LNEE 130, chp. 1, 2012.
[11] Jean Marie Dumont and Alain Thomas, “Gaussian asymptotic properties
of the sum-of-digits function,” Journal of Number Theory, no.62, pp.19-
38, 1997.
Figure 4. Used memory m in K2MT.
[12] Michael Drmota and Johaness Gajdosik, “The distribution of the sum-of-
digits function,” Journal Theor. Nombres Bordeaux, vol.10, pp.17-32,
CONCLUSIONS 1998.
[13] Eleazar J. Serrano, “Multiple hash tables for skewed sets of integer
Hashing with multiple tables of normalized size and special keys,” to appear in Proc. of the 12th International Conference on Applied
hash functions with the property of casting-out-nines present Informatics and Communications (AIC’12), Istanbul, Turkey, August
not only the obvious benefit of zero address collision, but a 2012.
possible large reduction in the used memory with respect to the
traditional hash table, for skewed sets of keys.