Newsgroups: comp.theory,sci.math,comp.lang.scheme,comp.lang.lisp
Path: cantaloupe.srv.cs.cmu.edu!das-news2.harvard.edu!news2.near.net!news.mathworks.com!yeshua.marcam.com!usc!howland.reston.ans.net!agate!darkstar.UCSC.EDU!news.hal.COM!decwrl!netcomsv!netcom.com!netcom19!vanmeule
From: vanmeule@netcom19.netcom.com (Andre van Meulebrouck)
Subject: hashing functions
Message-ID: <VANMEULE.94Oct25043449@netcom19.netcom.com>
Sender: vanmeule@netcom.com (Andre van Meulebrouck)
Cc: vanmeule@acm.org
Organization: NETCOM On-line services
Date: Tue, 25 Oct 1994 11:34:49 GMT
Lines: 112
Xref: glinda.oz.cs.cmu.edu comp.theory:10929 sci.math:84151 comp.lang.scheme:10610 comp.lang.lisp:15256

			       Purpose
			       -------

I want to solicit comments regarding what the best hashing functions
are and how they should compare against each other. 


			 A Hashing Simulator
                         -------------------

I have a simulator in which I am benchmarking linear, double, quad
hashing functions against each other and running statistics on them.

(Orthogonal comments: the simulator is entirely coded in Scheme, a
dialect of LISP.  I added comp.lang.scheme and comp.lang.lisp as hash
tables are often coded in LISP and are often used in the
implementation of LISP interpreters/compilers.  If my choice of
newsgroups is not appropriate, *PLEASE* fix headers in responses. ;-)


		       Preliminary Indications
                       -----------------------

So far, it looks like linear is fastest, quad is 2nd, and double is
3rd, but that linear blows up badly as the hash table gets more
crowded (double does the best then, with quad a close second).

Interestingly, they all seem to "peak" at the same configuration
(wherein "configuration" means the choice of buckets vs. number of
elements per bucket)!  For the size I'm doing the most testing at,
that is 240x5 where I hash 1,000 random integers.

In general, it would seem that linear is nice because it's so easy to
code, and it's so fast, but if you expect the table to get crowded,
quad would be a better bet (much improvement for minimal increase in
complexity).  Double is best for crowded tables, but may not be worth
the overhead and complexity for most uses.

I would have thought that double hashing would be the fastest, but at
the same time it doesn't seem unthinkable that linear would do so
well: hashing has to be *f*a*s*t* (as Knuth notes in his famous
volumes) and double has some overhead (though there may be double
hashing techniques with lighter overhead than the one I am using:
that's part of the feedback I'm soliciting).


		      My Current Hash Functions
                      -------------------------

My linear has a step size relatively prime to the table size (picked
to be the nearest relative prime of some percentage of the size of the
hash table (wherein "size of hash table" means "number of buckets").
The percentage I picked came from lots of trial and error
tests--roughly it's 10%.).

My quad is just a typical quad function.  Nothing special.

My double features a hash (based on a modulo of the first table) to a
second table whose size is relatively prime w.r.t. the size of the
(original) hash table, and all entries in the second hash table
represent step sizes which are picked to be relatively prime
w.r.t. the original hash table size.  The step sizes in the second
table go from 1 up to whatever, with the size of the second hash table
being some percentage of the size of the first hash table (again, I
honed what percentages work best by trial and error).

One last note on mechanics: The home bucket is computed first, then
successive step sizes are computed (until no collision is detected) at
which point a hash is done.  (The home bucket is considered as having
a step size of 0.)  There is a cap on the number of times a hash will
be attempted before the element is relegated to "overflow".  Currently
this cutoff is a percentage of the size of the hash table (currently
it's about 10%, which all my percentages seem to gravitate to!).


			      Questions
			      ---------

What I wish to solicit is ideas for hash functions (if you think the
ones I chose aren't the best) and any other comments/suggestions you
think might be helpful.

Also, does my conclusion (i.e. that linear is the fastest hash method,
until the table gets pathologically crowded) seem reasonable?


		Feasibility of Radical Hash Functions
		-------------------------------------

I'm especially interested in finding out if there are hash functions
other than linear, quad, double that are worth looking at.  

For instance, what about pseudo-random generators as hash functions?
Would they improve distribution (i.e. alleviate "bunching")?  Would
the overhead of creating pseudo random numbers be prohibitive?

I once saw an article, written by someone at Apple (in research) and I
believe he claimed that "perfect" distribution gotten from a pseudo
random generator didn't give the best hashing performance (i.e. that
you actually *don't* want perfect distribution despite the seeming
appeal of it to disperse things with maximal evenness through the
table) but I can't find the article again (anyone have a reference?).

What about using the golden ratio as a hash function?

				 ***

If you've read this far--thanks for your time and interest!
-- 
+------------------------------------------------------------------------+
| Andre van Meulebrouck.  For personal data:  finger vanmeule@netcom.com |
+------------------------------------------------------------------------+
