Vbhash
Vbhash
COLLECTIONS
AND HASH TABLES
Tom Niemann
Preface
Hash tables offer a method for quickly storing and accessing data based on a key value.
When you access a Visual Basic collection using a key, a hashing algorithm is used to
determine the location of the associated record. Collections, as implemented, do not
support duplicate keys. For this reason, and other performance considerations, you may
wish to code your own hashing algorithm.
I'll discuss open hashing, where data is stored in a node, and nodes are chained from
entries in a hash table. I'll also examine the effect of hash table size on execution time. This
is followed by a section on hashing algorithms. Then we'll look at different techniques for
representing nodes. Finally, I'll compare the various strategies, examining execution time
and storage requirements.
Source code for all examples may be downloaded from the site listed below. Cormen
[2001] and Knuth [1998] both contain excellent discussions on hashing. Stephens [1998]
is a good reference for hashing and node representation in Visual Basic. This article also
appears in the spring 1999 issue of the Technical Guide to Visual Programming, published
by Fawcette Technical Publications.
THOMAS NIEMANN
Portland, Oregon
-2-
Open Hashing
A hash table is simply an array that is addressed via a hash function. For example, in Figure
1, HashTable is an array with 8 elements. Each element is a pointer to a linked list of
numeric data. The hash function for this example simply divides the data key by 8, and
uses the remainder as an index into the table. This yields a number from 0 to 7. Since the
range of indices for HashTable is 0 to 7, we are guaranteed that the index is valid.
HashTable
0 #
1 # 16
2 #
3 #
4 # 11 27 19
5 #
6 #
7 # 22 6
To insert a new item in the table, we hash the key to determine which list the item goes
on, and then insert the item at the beginning of the list. For example, to insert 11, we divide
11 by 8 giving a remainder of 3. Thus, 11 goes on the list starting at HashTable(3). To
find a number, we hash the number and chain down the correct list to see if it is in the table.
To delete a number, we find the number and remove the node from the linked list.
Entries in the hash table are dynamically allocated and entered on a linked list
associated with each hash table entry. This technique is known as chaining. If the hash
function is uniform, or equally distributes the data keys among the hash table indices, then
hashing effectively subdivides the list to be searched. Worst-case behavior occurs when all
keys hash to the same index. Then we simply have a single linked list that must be
sequentially searched. Consequently, it is important to choose a good hash function. The
following sections describe several hashing algorithms.
Table Size
Assuming n data items, the hash table size should be large enough to accommodate a
reasonable number of entries. Table 1 shows the maximum time required to search for all
entries in a table containing 10,000 items.
-3-
A small table size substantially increases the time required to find a key. A hash table may
be viewed as a collection of linked lists. As the table becomes larger, the number of lists
increases, and the average number of nodes on each list decreases. If the table size is 1,
then the table is really a single linked list of length n. Assuming a perfect hash function, a
table size of 2 has two lists of length n/2. If the table size is 100, then we have 100 lists of
length n/100. This greatly reduces the length of the list to be searched. There is considerable
leeway in the choice of table size.
Hash Functions
In the previous example, we determined a hash value by examining the remainder after
division. In this section we’ll examine several algorithms that compute a hash value.
-4-
' 8-bit index
Private Const K As Long = 158
For example, if HashTableSize is 1024 (210), then a 16-bit index is sufficient and S would
be assigned a value of 2(16 - 10) = 64. Constant N would be 210 - 1, or 1023. Thus, we have:
h = 0
For i = 1 to Len(S)
h = h + Asc(Mid(S, i, 1))
Next i
Hash = h
End Function
-5-
Private Rand8(0 To 255) As Byte
h = 0
For i = 1 To Len(S)
h = Rand8(h Xor Asc(Mid(S, i, 1)))
Next i
Hash = h
End Function
Rand8 is a table of 256 8-bit unique random numbers. The exact ordering is not critical.
The exclusive-or method has its basis in cryptography, and is quite effective
(Pearson [1990]).
if Len(S) = 0 Then
Hash = 0
Exit Function
End If
h1 = Asc(Mid(S, 1, 1))
h2 = h1 + 1
For i = 2 To Len(S)
c = Asc(Mid(S, i, 1))
h1 = Rand8(h1 Xor c)
h2 = Rand8(h2 Xor c)
Next i
-6-
Hashing strings is computationally expensive, as we manipulate each byte in the string. A
more efficient technique utilizes a DLL, written in C, to perform the hash function.
Included in the download is a test program that hashes strings using both C and Visual
Basic. The C version is typically 20 times faster.
Node Representation
If you plan to code your own hashing algorithm, you'll need a way to store data in nodes,
and a method for referencing the nodes. This may be done by storing nodes in objects and
arrays. I'll use a linked-list to illustrate each method.
Objects
References to objects are implemented as pointers in Visual Basic. One implementation
simply defines the data fields of the node in a class, and accesses the fields from a module:
In the above code, pObj is internally represented as a pointer to the class. When we add a
new node to the list, an instance of the node is allocated, and a pointer to the node is placed
in pObj. The expression pObj.Value actually de-references the pointer, and accesses the
Value field. To delete the first node, we remove all references to the underlying class.
-7-
Arrays
An alternative implementation allocates an array of nodes, and the address of each node is
represented as an index into the array.
' initialization
hdrArr = 0
nxtArr = 1
Each field of a node is represented as a separate array, and referenced by subscripts instead
of pointers. For a more robust solution, there are several problems to solve. In this example,
we've allowed for 100 nodes, with no error checking. Enhancements could include
dynamically adjusting the arrays size when nxtArr exceeds array bounds. Also, no
provisions have been made to free a node for possible re-use. This may be accomplished
by maintaining a list of subscripts referencing free array elements, and providing functions
to allocate and free subscripts. Included in the download is a class designed to manage
node allocation, allowing for dynamic array resizing and node re-use.
-8-
Comparison
Table 2 illustrates resource requirements for a hash table implemented using three
strategies. The array method represents nodes as elements of an array, the object method
represents nodes as objects, while the collection method utilizes the built-in hashing feature
of Visual Basic collections.
time(ms)
n method insert find delete k Bytes faults
array 10 10 10 72 17
1,000 object 50 20 40 104 26
collection 60 40 40 100 25
array 40 50 40 228 132
5,000 object 301 90 261 744 186
collection 490 220 220 516 129
array 90 101 90 412 297
10,000 object 711 200 822 1,604 401
collection 1,111 471 491 1,044 261
array 450 541 540 2,252 1,524
50,000 object 7,481 1,062 13,279 8,480 2,118
collection 9,394 2,623 2,794 5,420 1,355
array 912 1,141 1,122 4,504 3,047
100,000 object 22,182 2,103 48,570 17,072 4,266
collection 27,830 5,658 5,918 10,896 2,724
Memory requirements and page faults are shown for insertion only. Hash table size for
arrays and objects was 1/10th the total count. Tests were run using a 200Mhz Pentium with
64Meg of memory on a Windows NT 4.0 operating system. Statistics for memory use and
page faults were obtained from the NT Task Manager. Code may be downloaded that
implements the above tests, so you may repeat the experiment.
It is immediately apparent that the array method is fastest, and consumes the least
memory. Objects consume four times as much memory as arrays. In fact, overhead for a
single object is about 140 bytes. Collections take about twice as much room as arrays.
An interesting anomaly is the high deletion time associated with objects. When we
increase the number of nodes from 50,000 to 100,000 (a factor of 2), the time for deletion
increases from 13 to 48 seconds (a factor of 4). During deletion, no page faults were noted.
Consequently, the extra overhead was compute time, not I/O time. One implementation
used at run-time for freeing memory involves maintaining a list, ordered by memory
location, of free nodes. When memory is freed, the list is traversed so that memory to be
released can be returned to the appropriate place in the list. This is done so that memory
chunks may be recombined when adjacent chunks are freed. Unfortunately, this algorithm
runs in O(n2) time, where execution time is roughly proportional to the square of the
number of items being released.
I encountered a similar problem while working on compilers for Apollo. In this case,
however, the problem was exacerbated by page faults that occurred while traversing links.
The solution involved an in-core index that reduced the number of links traversed.
-9-
Conclusion
Hashing is an effective method to quickly access data using a key value. Fortunately,
Visual Basic includes collections; an effective solution that is easy to code. In this article,
we compared collections with hand-coded solutions. Along the way we discovered that
storing data in objects for large datasets can incur substantial penalties in both execution
time and storage requirements. In this case, you can make significant gains by coding your
own algorithm, utilizing arrays for node storage. For smaller datasets, however, collections
remain a good choice.
Bibliography
Cormen, Thomas H., Charles E. Leiserson and Ronald L. Rivest [2001]. Introduction to
Algorithms. McGraw-Hill, New York.
Knuth, Donald. E. [1998]. The Art of Computer Programming, Volume 3, Sorting and
Searching. Addison-Wesley, Reading, Massachusetts.
Stephens, Rod [1998]. Ready-to-Run Visual Basic Algorithms. John Wiley & Sons, Inc.,
New York.
- 10 -