0% found this document useful (0 votes)

9 views58 pages

Unit Iii

hashing and sorting

Uploaded by

lakshmi shree

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views58 pages

Unit Iii

hashing and sorting

Uploaded by

lakshmi shree

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 58

UNIT –IV

HASHING AND SORTING

Hashing-Open addressing-Rehashing-Extendible hashing. Sorting-Bubble sort, insertion sort,
selection sort, shell sort, Heap sort, quick sort, Radix sort, Merge sort. Searching- Linear search,
Binary search.

4.1 Hash Tables and Direct Address Tables

 Define and describe what a hash table is
- Introduce key/value relationships
- Introduce concepts such as table size and other aspects of tables that are independent
of type and method of implementation.
- Define "n" for time-complexity.

 Iteration order for hash tables by augmenting the structure

- iterating over items in the order in which they were inserted
- iterating over the items based on most-recently-used
A hash table, or a hash map, is a data structure that associates keys with values. The
primary operation it supports efficiently is a lookup: given a key (e.g. a person's name), find the
corresponding value (e.g. that person's telephone number). It works by transforming the key
using a hash function into a hash, a number that the hash table uses to locate the desired value.

Time complexity and common uses of hash tables

Hash tables are often used to implement associative arrays, sets and caches. Like arrays, hash
tables provide constant-time O(1) lookup on average, regardless of the number of items in the
table. However, the rare worst-case lookup time can be as bad as O(n). Compared to other
associative array data structures, hash tables are most useful when large numbers of records of
data are to be stored.
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use
with persistent data structures; database indexes commonly use disk-based data structures based
on hash tables.

Hash tables are used to speed-up string searching in many implementations of data compression.

2.8 Hash Function-Open Addressing

A good hash function is essential for good hash table performance. A poor choice of hash
function is likely to lead to clustering, in which probability of keys mapping to the same hash
bucket (i.e. a collision) is significantly greater than would be expected from a random function.
Nonzero probability of collisions is inevitable in any hash implementation, but usually the
number of operations to resolve collisions scales linearly with the number of keys mapping to the
same bucket, so excess collisions will degrade performance significantly. In addition, some hash
functions are computationally expensive, so the amount of time (and, in some cases, memory)
taken to compute the hash may be burdensome.

Choosing a good hash function is tricky. The literature is replete with poor choices, at least when
measured by modern standards However, since poor hashing merely degrades hash table
performance for particular input key distributions, such problems commonly go undetected.
The literature is similarly sparse on the criteria for choosing a hash function. Unlike most other
fundamental algorithms and data structures, there is no universal consensus on what makes a
"good" hash function. The remainder of this section is organized by three criteria: simplicity,
speed, and strength, and will survey algorithms known to perform well by these criteria.

Simplicity and speed are readily measured objectively (by number of lines of code and CPU
benchmarks, for example), but strength is a more slippery concept. Obviously, a cryptographic
hash function such as SHA-1 would satisfy the relatively lax strength requirements needed for
hash tables, but their slowness and complexity makes them unappealing. In fact, even a
cryptographic hash does not provide protection against an adversary who wishes to degrade hash
table performance by choosing keys all hashing to the same bucket. For these specialized cases, a
universal hash function should be used instead of any one static hash, no matter how
sophisticated.

In the absence of a standard measure for hash function strength, the current state of the art is to
employ a battery of statistical tests to measure whether the hash function can be readily
distinguished from a random function. Arguably the most important such test is to determine
whether the hash function displays the avalanche effect, which essentially states that any single-
bit change in the input key should affect on average half the bits in the output. Bret Mulvey
advocates testing the strict avalanche condition in particular, which states that, for any single-bit
change, each of the output bits should change with probability one-half, independent of the other
bits in the key. Purely additive hash functions such as CRC fail this stronger condition miserably.

Clearly, a strong hash function should have a uniform distribution of hash values. Bret Mulvey
proposes the use of a chi-squared test for uniformity, based on power of two hash table sizes
ranging from 21 to 216. This test is considerably more sensitive than many others proposed for
measuring hash functions, and finds problems in many popular hash functions.

Fortunately, there are good hash functions that satisfy all these criteria. The simplest class all
consume one byte of the input key per iteration of the inner loop. Within this class, simplicity
and speed are closely related, as fast algorithms simply don't have time to perform complex
calculations. Of these, one that performs particularly well is the Jenkins One-at-a-time hash,
adapted here from an article by Bob Jenkins, its creator.

uint32 joaat_hash(uchar *key, size_t len)

{
uint32 hash = 0;
size_t i;

for (i = 0; i < len; i++)

{
hash += key[i];
hash += (hash << 10);
hash ^= (hash >> 6);
}
hash += (hash << 3);
hash ^= (hash >> 11);
hash += (hash << 15);
return hash;
}

Security Purpose
Avalanche behavior of Jenkins One-at-a-time hash over 3-byte keys
The avalanche behavior of this hash shown on the right. The image was made using Bret
Mulvey's AvalancheTest in his Hash.cs toolset. Each row corresponds to a single bit in the input,
and each column to a bit in the output. A green square indicates good mixing behavior, a yellow
square weak mixing behavior, and red would indicate no mixing. Only a few bits in the last byte
are weakly mixed, a performance vastly better than a number of widely used hash functions.

Many commonly used hash functions perform poorly when subjected to such rigorous avalanche
testing. The widely favored FNV hash, for example, shows many bits with no mixing at all,
especially for short keys. If speed is more important than simplicity, then the class of hash
functions which consume multibyte chunks per iteration may be of interest. One of the most
sophisticated is "lookup3" by Bob Jenkins, which consumes input in 12 byte (96 bit) chunks.
Note, though, that any speed improvement from the use of this hash is only likely to be useful for
large keys, and that the increased complexity may also have speed consequences such as
preventing an optimizing compiler from inlining the hash function. Bret Mulvey analyzed an
earlier version, lookup2, and found it to have excellent avalanche behavior.

One desirable property of a hash function is that conversion from the hash value (typically 32
bits) to an bucket index for a particular-size hash table can be done simply by masking,
preserving only the lower k bits for a table of size 2k (an operation equivalent to computing the
hash value modulo the table size). This property enables the technique of incremental doubling
of the size of the hash table - each bucket in the old table maps to only two in the new table.
Because of its use of XOR-folding, the FNV hash does not have this property. Some older hashes
are even worse, requiring table sizes to be a prime number rather than a power of two, again
computing the bucket index as the hash value modulo the table size. In general, such a
requirement is a sign of a fundamentally weak function; using a prime table size is a poor
substitute for using a stronger function.

Collision resolution
If two keys hash to the same index, the corresponding records cannot be stored in the same
location. So, if it's already occupied, we must find another location to store the new record, and
do it so that it can find out and when look it up later on.

To give an idea of the importance of a good collision resolution strategy, consider the following
result, derived using the birthday paradox. Even if it assumes that our hash function outputs
random indices uniformly distributed over the array, and even for an array with 1 million entries,
there is a 95% chance of at least one collision occurring before it contains 2500 records.

There are a number of collision resolution techniques, but the most popular are chaining and
open addressing.

Chaining
Hash collision resolved by chaining.
In the simplest chained hash table technique, each slot in the array references a linked list of
inserted records that collide to the same slot. Insertion requires finding the correct slot, and
appending to either end of the list in that slot; deletion requires searching the list and removal.

Chained hash tables have advantages over open addressed hash tables in that the removal
operation is simple and resizing the table can be postponed for a much longer time because
performance degrades more gracefully even when every slot is used. Indeed, many chaining hash
tables may not require resizing at all since performance degradation is linear as the table fills.
For example, a chaining hash table containing twice its recommended capacity of data would
only be about twice as slow on average as the same table at its recommended capacity.

Chained hash tables inherit the disadvantages of linked lists. When storing small records, the
overhead of the linked list can be significant. An additional disadvantage is that traversing a
linked list has poor cache performance.

Alternative data structures can be used for chains instead of linked lists. By using a self-
balancing tree, for example, the theoretical worst-case time of a hash table can be brought down
to O(log n) rather than O(n). However, since each list is intended to be short, this approach is
usually inefficient unless the hash table is designed to run at full capacity or there are unusually
high collision rates, as might occur in input designed to cause collisions. Dynamic arrays can
also be used to decrease space overhead and improve cache performance when records are small.

Some chaining implementations use an optimization where the first record of each chain is stored
in the table. Although this can increase performance, it is generally not recommended: chaining
tables with reasonable load factors contain a large proportion of empty slots, and the larger slot
size causes them to waste large amounts of space.

Open addressing

Hash collision resolved by linear probing (interval=1).

Open addressing hash tables can store the records directly within the array. A hash collision is
resolved by probing, or searching through alternate locations in the array (the probe sequence)
until either the target record is found, or an unused array slot is found, which indicates that there
is no such key in the table. Well known probe sequences include:

1.Linear probing in which the interval between probes is fixed—often at 1,

2.Quadratic probing in which the interval between probes increases linearly (hence, the indices
are described by a quadratic function), and
3.Double hashing in which the interval between probes is fixed for each record but is computed
by another hash function.
The main tradeoffs between these methods are that linear probing has the best cache performance
but is most sensitive to clustering, while double hashing has poor cache performance but exhibits
virtually no clustering; quadratic hashing falls in-between in both areas. Double hashing can also
require more computation than other forms of probing. Some open addressing methods, such as
last-come-first-served hashing and cuckoo hashing move existing keys around in the array to
make room for the new key. This gives better maximum search times than the methods based on
probing.

A critical influence on performance of an open addressing hash table is the load factor; that is,
the proportion of the slots in the array that are used. As the load factor increases towards 100%,
the number of probes that may be required to find or insert a given key rises dramatically. Once
the table becomes full, probing algorithms may even fail to terminate. Even with good hash
functions, load factors are normally limited to 80%. A poor hash function can exhibit poor
performance even at very low load factors by generating significant clustering. What causes hash
functions to cluster is not well understood, and it is easy to unintentionally write a hash function
which causes severe clustering.

Algorithm format
The following pseudocode is an implementation of an open addressing hash table with linear
probing and single-slot stepping, a common approach that is effective if the hash function is
good. Each of the lookup, set and remove functions use a common internal function findSlot to
locate the array slot that either does or should contain a given key.

record pair { key, value }

var pair array slot[0..numSlots-1]

function findSlot(key)
i := hash(key) modulus numSlots
loop
if slot[i] is not occupied or slot[i].key = key
return i
i := (i + 1) modulus numSlots

function lookup(key)
i := findSlot(key)
if slot[i] is occupied // key is in table
return slot[i].value
else // key is not in table
return not found

function set(key, value)

i := findSlot(key)
if slot[i] is occupied
slot[i].value := value
else
if the table is almost full
rebuild the table larger (note 1)
i := findSlot(key)
slot[i].key := key
slot[i].value := value

Another example showing open addressing technique. Presented function is converting each
part(4) of an Internet protocol address, where NOT is bitwise NOT, XOR is bitwise XOR, OR is
bitwise OR, AND is bitwise AND and << and >> are shift-left and shift-right:

// key_1,key_2,key_3,key_4 are following 3-digit numbers - parts of ip address xxx.xxx.xxx.xxx

function ip(key parts)
j := 1
do
key := (key_2 << 2)
key := (key + (key_3 << 7))
key := key + (j OR key_4 >> 2) * (key_4) * (j + key_1) XOR j
key := key AND _prime_ // _prime_ is a prime number
j := (j+1)
while collision
return key
note 1
Rebuilding the table requires allocating a larger array and recursively using the set operation to
insert all the elements of the old array into the new larger array. It is common to increase the
array size exponentially, for example by doubling the old array size.
function remove(key)
i := findSlot(key)
if slot[i] is unoccupied
return // key is not in the table
j := i
loop
j := (j+1) modulus numSlots
if slot[j] is unoccupied
exit loop
k := hash(slot[j].key) modulus numSlots
if (j > i and (k <= i or k > j)) or
(j < i and (k <= i and k > j)) (note 2)
slot[i] := slot[j]
i := j
mark slot[i] as unoccupied
note 2
For all records in a cluster, there must be no vacant slots between their natural hash position and
their current position (else lookups will terminate before finding the record). At this point in the
pseudocode, i is a vacant slot that might be invalidating this property for subsequent records in
the cluster. j is such as subsequent record. k is the raw hash where the record at j would naturally
land in the hash table if there were no collisions. This test is asking if the record at j is invalidly
positioned with respect to the required properties of a cluster now that i is vacant.
Another technique for removal is simply to mark the slot as deleted. However this eventually
requires rebuilding the table simply to remove deleted records. The methods above provide O(1)
updating and removal of existing records, with occasional rebuilding if the high water mark of
the table size grows.

The O(1) remove method above is only possible in linearly probed hash tables with single-slot
stepping. In the case where many records are to be deleted in one operation, marking the slots for
deletion and later rebuilding may be more efficient.

Open addressing versus chaining

Chained hash tables have the following benefits over open addressing:
They are simple to implement effectively and only require basic data structures.From the point of
view of writing suitable hash functions, chained hash tables are insensitive to clustering, only
requiring minimization of collisions. Open addressing depends upon better hash functions to
avoid clustering. This is particularly important if novice programmers can add their own hash
functions, but even experienced programmers can be caught out by unexpected clustering effects.
They degrade in performance more gracefully. Although chains grow longer as the table fills, a
chained hash table cannot "fill up" and does not exhibit the sudden increases in lookup times that
occur in a near-full table with open addressing. If the hash table stores large records, about 5 or
more words per record, chaining uses less memory than open addressing.If the hash table is
sparse (that is, it has a big array with many free array slots), chaining uses less memory than
open addressing even for small records of 2 to 4 words per record due to its external storage.

This graph compares the average number of cache misses required to lookup elements in tables
with chaining and linear probing. As the table passes the 80%-full mark, linear probing's
performance drastically degrades.For small record sizes (a few words or less) the benefits of in-
place open addressing compared to chaining are:

They can be more space-efficient than chaining since they don't need to store any pointers or
allocate any additional space outside the hash table. Simple linked lists require a word of
overhead per element.Insertions avoid the time overhead of memory allocation, and can even be
implemented in the absence of a memory allocator.Because it uses internal storage, open
addressing avoids the extra indirection required for chaining's external storage. It also has better
locality of reference, particularly with linear probing. With small record sizes, these factors can
yield better performance than chaining, particularly for lookups.They can be easier to serialize,
because they don't use pointers.On the other hand, normal open addressing is a poor choice for
large elements, since these elements fill entire cache lines (negating the cache advantage), and a
large amount of space is wasted on large empty table slots. If the open addressing table only
stores references to elements (external storage), it uses space comparable to chaining even for
large records but loses its speed advantage.

Normally open addressing is better used for hash tables with small records that can be stored
within the table (internal storage) and fit in a cache line. They are particularly suitable for
elements of one word or less. In cases where the tables are expected to have high load factors,
the records are large, or the data is variable-sized, chained hash tables often perform as well or
better.

Ultimately, used sensibly any kind of hash table algorithm is usually fast enough; and the
percentage of a calculation spent in hash table code is low. Memory usage is rarely considered
excessive. Therefore, in most cases the differences between these algorithms is marginal, and
other considerations typically come into play.

Coalesced hashing
A hybrid of chaining and open addressing, coalesced hashing links together chains of nodes
within the table itself. Like open addressing, it achieves space usage and (somewhat diminished)
cache advantages over chaining. Like chaining, it does not exhibit clustering effects; in fact, the
table can be efficiently filled to a high density. Unlike chaining, it cannot have more elements
than table slots.

Perfect hashing
If all of the keys that will be used are known ahead of time, and there are no more keys that can
fit the hash table, perfect hashing can be used to create a perfect hash table, in which there will
be no collisions. If minimal perfect hashing is used, every location in the hash table can be used
as well.

Perfect hashing gives a hash table where the time to make a lookup is constant in the worst case.
This is in contrast to chaining and open addressing methods, where the time for lookup is low on
average, but may be arbitrarily large. There exist methods for maintaining a perfect hash function
under insertions of keys, known as dynamic perfect hashing. A simpler alternative, that also
gives worst case constant lookup time, is cuckoo hashing.

Probabilistic hashing
Perhaps the simplest solution to a collision is to replace the value that is already in the slot with
the new value, or slightly less commonly, drop the record that is to be inserted. In later searches,
this may result in a search not finding a record which has been inserted. This technique is
particularly useful for implementing caching.

An even more space-efficient solution which is similar to this is use a bit array (an array of one-
bit fields) for table. Initially all bits are set to zero, and when it insert a key,it is to be set the
corresponding bit to one. False negatives cannot occur, but false positives can, since if the search
finds a 1 bit, it will claim that the value was found, even if it was just another value that hashed
into the same array slot by coincidence. In reality, such a hash table is merely a specific type of
Bloom filter.

Table resizing
With a good hash function, a hash table can typically contain about 70%–80% as many elements
as it does table slots and still perform well. Depending on the collision resolution mechanism,
performance can begin to suffer either gradually or dramatically as more elements are added. To
deal with this, when the load factor exceeds some threshold, we allocate a new, larger table, and
add all the contents of the original table to this new table.

This can be a very expensive operation, and the necessity for it is one of the hash table's
disadvantages. In fact, some naive methods for doing this, such as enlarging the table by one
each time you add a new element, reduce performance so drastically as to make the hash table
useless. However, if we enlarge the table by some fixed percent, such as 10% or 100%, it can be
shown using amortized analysis that these resizings are so infrequent that the average time per
lookup remains constant-time. To see why this is true, suppose a hash table using chaining
begins at the minimum size of 1 and is doubled each time it fills above 100%. If in the end it
contains n elements, then the total add operations performed for all the resizings is:

1 + 2 + 4 + ... + n = 2n - 1.
Because the costs of the resizings form a geometric series, the total cost is O(n). But we also
perform n operations to add the n elements in the first place, so the total time to add n elements
with resizing is O(n), an amortized time of O(1) per element.

On the other hand, some hash table implementations, notably in real-time systems, cannot pay
the price of enlarging the hash table all at once, because it may interrupt time-critical operations.
One simple approach is to initially allocate the table with enough space for the expected number
of elements and forbid the addition of too many elements. Another useful but more memory-
intensive technique is to perform the resizing gradually:

Allocate the new hash table, but leave the old hash table and check both tables during lookups.
Each time an insertion is performed, add that element to the new table and also move k elements
from the old table to the new table.When all elements are removed from the old table, deallocate
it.To ensure that the old table will be completely copied over before the new table itself needs to
be enlarged, it's necessary to increase the size of the table by a factor of at least (k + 1)/k during
the resizing.

Linear hashing is a hash table algorithm that permits incremental hash table expansion. It is
implemented using a single hash table, but with two possible look-up functions.

Another way to decrease the cost of table resizing is to choose a hash function in such a way that
the hashes of most values do not change when the table is resized. This approach, called
consistent hashing, is prevalent in disk-based and distributed hashes, where resizing is
prohibitively costly.

Ordered retrieval issue

Hash tables store data in pseudo-random locations, so accessing the data in a sorted manner is a
very time consuming operation. Other data structures such as self-balancing binary search trees
generally operate more slowly (since their lookup time is O(log n)) and are rather more complex
to implement than hash tables but maintain a sorted data structure at all times. See a comparison
of hash tables and self-balancing binary search trees.

Problems with hash tables

Although hash table lookups use constant time on average, the time spent can be significant.
Evaluating a good hash function can be a slow operation. In particular, if simple array indexing
can be used instead, this is usually faster.

Hash tables in general exhibit poor locality of reference—that is, the data to be accessed is
distributed seemingly at random in memory. Because hash tables cause access patterns that jump
around, this can trigger microprocessor cache misses that cause long delays. Compact data
structures such as arrays, searched with linear search, may be faster if the table is relatively small
and keys are cheap to compare, such as with simple integer keys. According to Moore's Law,
cache sizes are growing exponentially and so what is considered "small" may be increasing. The
optimal performance point varies from system to system; for example, a trial on Parrot shows
that its hash tables outperform linear search in all but the most trivial cases (one to three entries).

More significantly, hash tables are more difficult and error-prone to write and use. Hash tables
require the design of an effective hash function for each key type, which in many situations is
more difficult and time-consuming to design and debug than the mere comparison function
required for a self-balancing binary search tree. In open-addressed hash tables it's even easier to
create a poor hash function.
Additionally, in some applications, a black hat with knowledge of the hash function may be able
to supply information to a hash which creates worst-case behavior by causing excessive
collisions, resulting in very poor performance (i.e., a denial of service attack). In critical
applications, either universal hashing can be used or a data structure with better worst-case
guarantees may be preferable.

Other hash table algorithms

Extendible hashing and linear hashing are hash algorithms that are used in the context of
database algorithms used for instance in index file structures, and even primary file organization
for a database. Generally, in order to make search scalable for large databases, the search time
should be proportional log N or near constant, where N is the number of records to search. Log N
searches can be implemented with tree structures, because the degree of fan out and the shortness
of the tree relates to the number of steps needed to find a record, so the height of the tree is the
maximum number of disc accesses it takes to find where a record is. However, hash tables are
also used, because the cost of a disk access can be counted in units of disc accesses, and often
that unit is a block of data. Since a hash table can, in the best case, find a key with one or two
accesses, a hash table index is regarded as generally faster when retrieving a collection of records
during a join operation e.g.

SELECT * from customer, orders where customer.cust_id = orders.cust_id and cust_id = X

i.e. If orders has a hash index on cust_id, then it takes constant time to locate the block that
contains record locations for orders matching cust_id = X. (although, it would be better if the
value type of orders was a list of order ids, so that hash keys are just one unique cust_id for each
batch of orders, to avoid unnecessary collisions).

Extendible hashing and linear hashing have certain similarities: collisions are accepted as
inevitable and are part of the algorithm where blocks or buckets of collision space is added ;
traditional good hash function ranges are required, but the hash value is transformed by a
dynamic address function : in extendible hashing, a bit mask is used to mask out unwanted bits,
but this mask length increases by one periodically, doubling the available addressing space ; also
in extendible hashing, there is an indirection with a directory address space, the directory entries
being paired with another address (a pointer ) to the actual block containing the key-value pairs;
the entries in the directory correspond to the bit masked hash value (so that the number of entries
is equal to maximum bit mask value + 1 e.g. a bit mask of 2 bits, can address a directory of 00 01
10 11, or 3 + 1 = 4).

In linear hashing, the traditional hash value is also masked with a bit mask, but if the resultant
smaller hash value falls below a 'split' variable, the original hash value is masked with a bit mask
of one bit greater length, making the resultant hash value address recently added blocks. The
split variable ranges incrementally between 0 and the maximum current bit mask value e.g. a bit
mask of 2, or in the terminology of linear hashing, a "level" of 2, the split variable will range
between 0 to 3. When the split variable reaches 4, the level increases by 1, so in the next round
of the split variable, it will range between 0 to 7, and reset again when it reaches 8.

The split variable incrementally allows increased addressing space, as new blocks are added; the
decision to add a new block occurs whenever a key-and=value is being inserted, and overflows
the particular block the key-and-value's key hashes into. This overflow location may be
completely unrelated to the block going to be split pointed to by the split variable. However, over
time, it is expected that given a good random hash function that distributes entries fairly evenly
amongst all addressable blocks, the blocks that actually require splitting because they have
overflowed get their turn in round-robin fashion as the split value ranges between 0 - N where N
has a factor of 2 to the power of Level, level being the variable incremented whenever the split
variable hits N.

New blocks are added one at a time with both extendible hashing, and with linear hashing.

In extendible hashing, a block overflow ( a new key-value colliding with B other key-values,
where B is the size of a block) is handled by checking the size of the bit mask "locally", called
the "local depth", an attribute which must be stored with the block. The directory structure, also
has a depth, the "global depth". If the local depth is less than the global depth, then the local
depth is incremented, and all the key values are rehashed and passed through a bit mask which is
one bit longer now, placing them either in the current block, or in another block. If the other
block happens to be the same block when looked up in the directory, a new block is added, and
the directory entry for the other block is made to point to the new block. Why does the directory
have entries where two entries point to the same block ? This is because if the local depth is
equal to the global depth of the directory, this means the bit mask of the directory does not have
enough bits to DEAL with an increment in the bit mask length of the block, and so the directory
must have its bit mask length incremented, but this means the directory now doubles the number
of addressable entries. Since half the entries addressable don't exist, the directory simply copies
the pointers over to the new entries e.g. if the directory had entries for 00, 01, 10, 11, or a 2 bit
mask, and it becomes a 3 bit mask, then 000 001 010 011 100 101 110 111 become the new
entries, and 00's block address go to 000 and 001 ; 01's pointer goes to 010 and 011, 10 goes to
100 and 101 and so on. And so this creates the situation where two directory entries point to the
same block. Although the block that was going to overflow, now can add a new block by
redirecting the second pointer to a newly appended block, the other original blocks will have two
pointers to them. When it is their turn to split, the algorithm will check local vs global depth and
this time find that the local depth is less, and hence no directory splitting is required, only a new
block be appended, and the second directory pointer moved from addressing the previous block
to addressing the new block.

In linear hashing, adding a similarly hashed block does not occurs immediately when a block
overflows, and therefore an overflow block is created to be attached to the overflowing block.
However, a block overflow is a signal that more space will be required, and this happens by
splitting the block pointed to by the "split" variable, which is initially zero, and hence initially
points to block zero. The splitting is done by taking all the key-value pairs in the splitting block,
and its overflow block(s), hashing the keys again, but with a bit mask of length current level + 1.
This will result in two block addresses, some will be the old block number, and others will be

a2 = old block number + ( N times 2 ^ (level) )

Rationale
Let m = N times 2 ^ level ; if h is the original hash value, and old block number = h mod m, and
now the new block number is h mod ( m * 2 ), because m * 2 = N times 2 ^ (level+1), then the
new block number is either h mod m if (h / m) is even so dividing h/m by 2 leaves a zero
remainder and therefore doesn't change the remainder, or the new block number is ( h mod m ) +
m because h / m is an odd number, and dividing h / m by 2 will leave an excess remainder of m,
+ the original remainder. ( The same rationale applies to extendible hashing depth
incrementing ).

As above, a new block is created with a number a2, which will usually occur at +1 the previous
a2 value. Once this is done, the split variable is incremented, so that the next a2 value will be
again old a2 + 1. In this way, each block is covered by the split variable eventually, so each
block is preemptively rehashed into extra space, and new blocks are added incrementally.
Overflow blocks that are no longer needed are discarded, for later garbage collection if needed,
or put on an available free block list by chaining.

When the split variable reaches ( N times 2 ^ level ), level is incremented and split variable is
reset to zero. In this next round, the split variable will now traverse from zero to ( N times 2 ^
(old_level + 1 ) ), which is exactly the number of blocks at the start of the previous round, but
including all the blocks created by the previous round.

A simple inference on file storage mapping of linear hashing and extendible hashing[edit]
As can be seen, extendible hashing requires space to store a directory which can double in size.

Since the space of both algorithms increase by one block at a time, if blocks have a known
maximum size or fixed size, then it is straight forward to map the blocks as blocks sequentially
appended to a file.

In extendible hashing, it would be logical to store the directory as a separate file, as doubling can
be accommodated by adding to the end of the directory file. The separate block file would not
have to change, other than have blocks appended to its end.
Header information for linear hashing doesn't increase in size : basically just the values for N,
level, and split need to be recorded, so these can be incorporated as a header into a fixed block
size linear hash storage file.

However, linear hashing requires space for overflow blocks, and this might best be stored in
another file, otherwise addressing blocks in the linear hash file is not as straight forward as
multiplying the block number by the block size and adding the space for N,level, and split.

4.4 SORTING

 Sorting is nothing but systematic arrangement of the data based. The systematic arrangement
means based on some key the data should be arranged.
 Types of sorting
 Internal sorting
 External sorting
 Internal Sorting:
Sort can be done in main memory, so that the number of elements is relatively small (less
than a million).

 External Sorting
Sorts that cannot be performed in main memory and must be done on disk or tape are also
quite important. This type of sorting, known as external sorting

Topics covered:

 Bubble sort
 Insertion sort
 Selection sort
 Shell sort
 Heap sort
 Merge sort
 Quick sort
 Radix sort

4.5 Bubble sort:

 Bubble is a simple and well-known sorting algorithm.

 It is used in practice once in a blue moon and its main application is to make an introduction to
the sorting algorithm.

Algorithm:

1. Compare each pair of adjacent elements from the beginning of an array, if they are in
reversed order, swap them.
2. It at least one swap has been done, repeat step 1
Example. Sort {5, 1, 12, -5, 16} using bubble sort.

Bubble Sort program:

/* Program to Sort an Array using Bubble Sort */

#include <stdio.h>
#include<conio.h>
void bubble_sort();
int a[10], n;
void main()
{
int i;
printf("\n Enter size of an array: ");
scanf("%d", &n);
printf("\n Enter elements of an array:\n");
for(i=0; i<n; i++)
scanf("%d", &a[i]);
bubble_sort();
printf("\n\nAfter sorting:\n");
for(i=0; i<n; i++)
printf("\n%d", a[i]);
getch();
}

void bubble_sort()
{
int i, j, temp;
for(i=0; i<n; i++)
{
printf( “ Pass -> %d”,i+1);
for(j=0; j<(n-1)-j; j++)
{
if(a[j] > a[j+1])
{
temp = a[j];
a[j] = a[j+1];
a[j+1] = temp;
}
}
}
}

Implementation
 The efficiency of a sorting algorithm is measured in terms of number of comparisons.
 In bubble sort, there are n–1 comparisons in Pass 1, n–2 comparisons in Pass 2, and so on.
 Total number of comparisons = (n–1) + (n –2) + (n –3) +… +3+2+1= n(n – 1)/2.
 n(n – 1)/2 is of O(n2) order. Therefore, the bubble sort algorithm is of the order O(n2).
 Average and worst case complexity of bubble sort is O(n 2). Also, it makes O(n2) swaps in the
worst case.
 Avoid implementations, which don't check if the array is already sorted on every step (any swaps
made). This check is necessary, in order to preserve adaptive property.
 One more problem of bubble sort is that its running time badly depends on the initial order of the
elements.
 Big elements (rabbits) go up fast, while small ones (turtles) go down very slow.

4.6 Selection sort:

The idea of algorithm is quite simple. Array is imaginary divided into two parts - sorted
one and unsorted one. At the beginning, sorted part is empty, while unsorted one contains whole
array. At every step, algorithm finds smallest element in the unsorted part and adds it to the end of
the sorted one. When unsorted partbecomes empty, algorithm stops.

When algorithm sorts an array, it swaps first element of unsorted part with minimal
element and then it is included to the sorted part. This implementation of selection sort in not stable.
In case of linked list is sorted, and, instead of swaps, minimal element is linked to the unsorted part,
selection sort is stable.

The algorithm work as follows:

1. Set the first position as current position

2. Find the minimum value in the list
3. Swap it with the value in the current position
4. Set next position as current position
5. Repeat steps 2- 4 until you reach end of list.

Let us see an example of sorting an array to make the idea of selection sort clearer.
Example. Sort {5, 1, 12, -5, 16, 2, 12, 14} using selection sort.
Efficiency of selection Sort:

 Selection sort stops, when unsorted part becomes empty. As we know, on every step
number of unsorted elements decreased by one.
 There are n–1 comparisons during Pass 1 to find the smallest element, n–2 comparisons
during Pass 2 to find the second smallest element, and so on.
 Total number of comparisons = (n – 1) + (n – 2) + (n – 3) + … + 3 + 2 + 1 = n(n – 1)/2
 n(n – 1)/2 is of O(n2) order. Therefore, the selection sort algorithm is of the order O(n2).
 But the Fact, that selection sort requires n - 1 number of swaps at most, makes it very
efficient in situations, when write operation is significantly more expensive, than read
operation.

Implementation Selection Sort :

/***** Program to Sort an Array using Selection Sort *****/

#include <stdio.h>
#include <conio.h>
void selection_sort();
int a[10], n;
void main()
{
int i;
printf("\nEnter size of an array: ");
scanf("%d", &n);
printf("\nEnter elements of an array:\n");
for(i=0; i<n; i++)
scanf("%d", &a[i]);
selection_sort();
printf("\n\nAfter sorting:\n");
for(i=0; i<n; i++)
printf("\n%d", a[i]);
getch();
}

void selection_sort()
{
int i, j, min, temp;
for (i=0; i<n; i++)
{
min = i;
for (j=i+1; j<n; j++)
{
if (a[j] < a[min])
min = j;
}
temp = a[i];
a[i] = a[min];
a[min] = temp;
}
}

4.7 Insertion Sort:

 Insertion sort algorithm somewhat resembles selection sort.

 Here, sorting takes place by inserting a particular element at the appropriate position that’s why
the name – insertion sorting
 Array is imaginary divided into two parts - sorted one and unsorted one.
 At the beginning, sorted part contains first element of the array and unsorted one contains the
rest.
 At every step, algorithm takes first element in the unsorted part and inserts it to the right place
of the sorted one.
 When unsorted part becomes empty, algorithm stops. Sketchy, insertion sort algorithm step
looks like this:
 This procedure is repeated for all the elements in the list.

becomes

Let us see an example of insertion sort routine to make the idea of algorithm clearer.

Example. Sort {7, -5, 2, 16, 4} using insertion sort.

The ideas of insertion

The main operation of the algorithm is insertion. The task is to insert a value into the
sorted part of the array. Let us see the variants of how we can do it.

Shifting instead of swapping

It will write sifted element only to the final correct position. Let us see an illustration.

It is the most commonly used modification of the insertion sort.

Using binary search

It is reasonable to use binary search algorithm to find a proper place for insertion. This
variant of the insertion sort is called binary insertion sort. After position for insertion is found,
algorithm shifts the part of the array and inserts the element. This version has lower number of
comparisons, but overall average complexity remains O(n2). From a practical point of view this
improvement is not very important, because insertion sort is used on quite small data sets.

Algorithm:

Insert (Arr, N) where ARR is an array of N elements.

Step1: Repeat for p = 1,2,3,….N
Step2: Assign temp = arr[p]
Step3: Repeat for j=p to 0 and Arr[j-1]>temp
arr[j]=arr[j-1]
else goto step 6
Step4: Decrement j by 1
Step5: [End of step3 for loop]
Step6: Assign Arr[j]=temp
Step7: [End of step1 for loop]
Step8: print the sorted array ARR
Insertion Sort Program:

/ Program to Sort an Array using Insertion Sort /

#include <stdio.h>
#include<conio.h>
void insertion_sort();

int a[10],n;
void main()
{
int i;
printf("\nEnter size of an array: ");
scanf("%d", &n);
printf("\nEnter elements of an array:\n");
for(i=0; i<n; i++)
scanf("%d", &a[i]);
insertion_sort();
printf("\n\nAfter sorting:\n");
for(i=0; i<n; i++)
printf("\n%d", a[i]);
getch();
}

void insertion_sort()
{
int i, j, temp;
for(i=1; i<n; i++)
{
temp = a[i];
j = i-1;
while (j>=0 && a[j]>temp)
{
a[j+1] = a[j];
j--;
}
a[j+1] = temp;
}
}

4.8 Shell sort

 Shell sort is an extension of insertion sort

 Limitation of insertion sort
It computes only the consecutive elements and interchanges the elements by only one
space. The smaller elements that are fast array require many passes through the sort to property
insert them in its correct position.

 Shell sort overcomes this limitation by computing elements that are at a specific distance from
each other and interchanges them if necessary.
 Shell sort divides the list into smallest sub lists and then sorts sub lists separately using insertion
sort.
 This method splits the input list into h-in depending sorted list.
 The value of h will be initially high and is repeatedly decremented until it reaches 1.
 When h is equal to 1. A regular insertion self is performed on the list, but by then the list is
guaranteed to be almost sorted.
 In the next pass, it takes all the elements and sorts the entire list.
 Shell sort is also called as diminishing insertion sort, because the number of elements
compared in a group continuously decreases.

o Select the distance by which the elements in a group will be separated to form multiple
sublists.
o Apply insertion sort on each sublist to move the elements towards their correct positions.

Apply insertion sort to sort the three lists

Increment = 2 Pass = 2

Apply insertion sort on each sublist

0 1 2 3 4 5 6 7 8 9 10

10 30 20 40 45 60 70 80 75 90 110
arr

Apply insertion sort to sort the list

0 1 2 3 4 5 6 7 8 9 10

10 20 30 40 45 60 70 75 80 90 110
arr

The list is now sorted

 Shell sort improves insertion sort by comparing the elements separated by a distance of
several positions. This helps an element to take a bigger step towards its correct position,
thereby reducing the number of comparisons.

Shell sort routine:

#include <stdio.h>
#include<conio.h>
void shell_sort();
int a[10],n;
void main()
{
int i;
printf("\nEnter size of an array: ");
scanf("%d", &n);
printf("\nEnter elements of an array:\n");
for(i=0; i<n; i++)
scanf("%d", &a[i]);
shell_sort();
printf("\n\nAfter sorting:\n");
for(i=0; i<n; i++)
printf("\n%d", a[i]);
getch();
}
void shell (int a[], int n)
{
for(i=(n+1)/2;i≥1;i/=2)
{
for{j=1;j<=n-1;j++)
{
temp=a[j];
k=j-i;
while(k≥0&&temp<a[k])
{
a[k+i]=a[k];
k=k- i;
}
a[k+i]=temp;
}
}
}
4.9 Radix sort:

 Radix sort is a clever and intuitive little sorting algorithm.

 Radix sort puts the elements in order by comparing the Digits of the numbers.
Ex: Consider the following 9 numbers
493, 812, 715, 710, 195, 437, 582, 340, 385
 We should start sorting by comparing and ordering the One’s digits:
Digit Number (submit)

0 – 340, 710
1-
2 – 812, 582
3 - 493
4
5 – 715, 195, 385
6
7 - 437
8
9
Now, we gather the all numbers (sub list) in order from the 0 sub list to 0 sub list) into the
main list again. 340,710,812,582,493,715,195,385,437

 Now, the sub lists are creating again this time based on Ten’s digit:

Digit Number (submit)

0-
1 – 710, 812, 715
2
3 - 437
4 - 340
5-
6-
7
8 – 582, 385
9 – 493, 195

 Now, the sub lists (number) are gathered in order from 0 to 9

710, 812, 715,437, 340, 582, 385, 493, 195
 Finally, the sub lists (number) are created according to the Hundred’s digit:
Digit Number (submit)

0-
1 – 195
2
3 – 340, 385
4 – 437, 493
5 - 582
6-
7 - 710, 715
8 – 812
9–
 At last, the list is gathered up again:
195, 340, 385, 437, 493, 582, 710, 715, 812
 And now we have a fully sorted array Radix sort is very simple, and a computer can do it fast.
When it is programmed properly, Radix sort is in fact one of the fastest sorting algorithms for
numbers or string of letters.
 It is array A contains 1, 13, 24, 26 and B contains 2, 15, 27, 38 then the algorithm proceeds as
follows: First a comparison is done between 1 and 2. 1 is added to C and then 13 and 2 are
compared.
 Next 2 is added to C, and then 13 and 15 are compared.
 13 is added to C, and then 24 and 15 are compared. This proceeds until 26 and 27 are compared.
4.10 Heap Sort:

 Consider the input sequence 31,41,59,26,53,58,97

 Heap sort can be implemented using two types of techniques.
 Maxheap:
In maxheap, the root node will be greater than their children nodes.
 Min heap:
In minheap, the root node will be smaller than their children nodes.
Maxheap:
In heap Sort, the root node is compared with the last right most leaf node and replace with root
node and write the root node as output.
Heap Sort
#include<stdio.h>
#include<conio.h>
int hsort[25],n,i;
void adjust(int,int);
void heapify();
void main()
{
int temp;
clrscr();
printf("\n\t\t\t\tHEAP SORT");
printf("\n\t\t\t\t**** ****\n\n\n");
printf("\nenter no of elements:");
scanf("%d",&n);
printf("\nenter elements to be sorted\n\n");
for(i=1;i<=n;i++)
scanf("%d",&hsort[i]);
heapify();
for(i=n;i>=2;i--)
{
temp=hsort[1];
hsort[1]=hsort[i];
hsort[i]=temp;
adjust(1,i-1);
}
printf("\nSORTED ELEMENT\n\n");
for(i=1;i<=n;i++)
printf("%d\n",hsort[i]);
getch();
}
void heapify()
{
int i;
for(i=n/2;i>=1;i--)
adjust(i,n);
}
void adjust(int i,int n)
{
int j,element;
j=2*i;
element=hsort[i];
while(j<=n)
{
if((j<n)&&(hsort[j]<hsort[j+1]))
j=j++;
if(element>=hsort[j])
break;
hsort[j/2]=hsort[j];
j=2*j;
}
hsort[j/2]=element;
4.11 Quicksort

Quicksort is a fast sorting algorithm, which is used not only for educational purposes, but
widely applied in practice. On the average, it has O(n log n) complexity, making quicksort suitable
for sorting big data volumes. The idea of the algorithm is quite simple and once you realize it, you
can write quicksort as fast as bubble sort.

Algorithm

The divide-and-conquer strategy is used in quicksort. Below the recursion step is described:

1. Choose a pivot value. We take the value of the middle element as pivot value, but it can
be any value, which is in range of sorted values, even if it doesn't present in the array.
2. Partition. Rearrange elements in such a way, that all elements which are lesser than the
pivot go to the left part of the array and all elements greater than the pivot, go to the right
part of the array. Values equal to the pivot can stay in any part of the array. Notice, that
array may be divided in non-equal parts.
3. Sort both parts. Apply quicksort algorithm recursively to the left and the right parts.

Partition algorithm in detail

There are two indices i and j and at the very beginning of the partition algorithm. i points
to the first element in the array and j points to the last one. Then algorithm moves i forward, until
an element with value greater or equal to the pivot is found. Index j is moved backward, until an
element with value lesser or equal to the pivot is found. If i ≤ j then they are swapped and i steps
to the next position (i + 1), j steps to the previous one (j - 1). Algorithm stops, when i becomes
greater than j. After partition, all values before ith element are less or equal than the pivot and all

values after jth element are greater or equal to the pivot.

Example. Sort {1, 12, 5, 26, 7, 14, 3, 7, 2} using quicksort.
Notice, that we show here only the first recursion step, in order not to make example too
long. But, in fact, {1, 2, 5, 7, 3} and {14, 7, 26, 12} are sorted then recursively.

Why does it work?

On the partition step algorithm divides the array into two parts and every element a from
the left part is less or equal than every element b from the right part. Also a and b satisfy a ≤
pivot ≤ b inequality. After completion of the recursion calls both of the parts become sorted and,
taking into account arguments stated above, the whole array is sorted.

4.12 Merge Sort

Merge sort algorithm:

1. It is based on the divide and conquer approach

2. Divides the list into two subsists of sizes as nearly equal as possible
3. Sorts the two subsists separately by using merge sort
4. Merges the sorted sublists into one single list
 To understand the implementation of merge sort algorithm, consider an unsorted list of numbers
stored in an array.

 Let us sort this unsorted list.

 The first step to sort data by using merge sort is to split the list into two parts.
0 1 2 3 4 5 6

53 10 30 76 3 57 24
arr

 The list has odd number of elements, therefore, the left sublist is longer than the right sublist by
one entry.
 There is a single element left in each sublist.
 Sublists with one element require no sorting.

 Start merging the sublists to obtain a sorted list.

 Further merge the sublists.

 Again, merge the sublists.

 To sort the list by using merge sort algorithm, you need to recursively divide the list into two
nearly equal sublists until each sublist contains only one element.
 To divide the list into sublists of size one requires log n passes.
 In each pass, a maximum of n comparisons are performed.
 Therefore, the total number of comparisons will be a maximum of n × log n.
 The efficiency of merge sort is equal to O(n log n)
 There is no distinction between best, average, and worst case efficiencies of merge sort because

all of them require the same amount of time.

MERGE SORT Program

/ Program to Sort an Array using Merge Sort /

#include <stdio.h>
void merge_sort(int [], int, int);
void merge_array(int [], int, int, int);
main()
{
int a[50], n, i;
printf("\nEnter size of an array: ");
scanf("%d", &n);
printf("\nEnter elements of an array:\n");
for(i=0; i<n; i++)
scanf("%d", &a[i]);
merge_sort(a, 0, n-1);
printf("\n\nAfter sorting:\n");
for(i=0; i<n; i++)
printf("\n%d", a[i]);
getch();
}
void merge_sort(int a[], int beg, int end)
{
int mid;
if (beg < end)
{
mid = (beg+end)/2;
merge_sort(a, beg, mid);
merge_sort(a, mid+1, end);
merge_array(a, beg, mid, end);
}
}

void merge_array(int a[], int beg, int mid, int end)

{
int i, left_end, num, temp, j, k, b[50];
for(i=beg; i<=end; i++)
b[i] = a[i];
i = beg;
j = mid+1;
k = beg;
while ((i<=mid) && (j<=end))
{
if (b[i] <= b[j])
{
a[k] = b[i];
i++;
k++;
}
else
{
a[k] = b[j];
j++;
k++;
}
}
if (i <= mid)
{
while (i <= mid)
{
a[k] = b[i];
i++;
k++;
}
}
else
{
while (j <= end)
{
a[k] = b[j];
j++;
k++;
}
}
}

4.13 Searching- Linear search, Binary search

One of the more important tasks performed by computers is the location and retrieval of data.

For data held in arrays there are a number of possibilities, and one of these is a simple technique
referred to as linear search.
A second approach, known as binary search, will be discussed in the next section.

These sections assume that there is no duplication of data within the data set, but the techniques
can be extended to cover data sets that do contain duplicates.

The item being searched for will be referred to as the target.

Linear Search

To perform a linear search of data held in an array, the search starts at one end (usually
the low numbered element of the array) and examines each element in the array until one of two
conditions is met., either Condition 1: the target has been found or Condition 2: the end of the
data has been reached (the target value is not in the data set).

Note that the algorithm requires that both tests are performed and that the search
terminates when one of the conditions becomes true. The second test is required to prevent the
algorithm from attempting to search past the end of the data. For illustration, consider the
following data set. Again, element 0 is leftmost:

To search for the value 7 in the array, we start by examining the first element of the array.
This does not match the target, so we increment the index counter, and try again. We now
examine the next element of the array, which has the value of 23. This does not match the target,
so we again increment the counter. This now means that we are examining the element
containing 7. This is what we are looking for, so the search is terminated, and the result of the
search is reported back to the calling function. It is usual to return the index of the element
containing the target, but there may be circumstances where a different return value may be
needed.
In Linear Search the list is searched sequentially and the position is returned if the key
element to be searched is available in the list, otherwise -1 is returned. The search in Linear
Search starts at the beginning of an array and move to the end, testing for a match at each item.

All the elements preceding the search element are traversed before the search element is
traversed. i.e. if the element to be searched is in position 10, all elements form 1-9 are checked
before 10.

Algorithm : Linear search implementation

bool linear_search ( int *list, int size, int key, int* rec )
{
// Basic Linear search
bool found = false;
int i;
for ( i = 0; i < size; i++ )
{
if ( key == list[i] )
break;
}
if ( i < size )
{
found = true;
rec = &list[i];
}
return found;
}
The code searches for the element through a loop starting from 0 to n. The loop can
terminate in one of two ways. If the index variable i reach the end of the list, the loop condition
fails. If the current item in the list matches the key, the loop is terminated early with a break
statement. Then the algorithm tests the index variable to see if it is less than that size (thus the
loop was terminated early and the item was found), or not (and the item was not found).

Ex.
Assume the element 45 is searched from a sequence of sorted elements 12, 18, 25, 36, 45,
48, 50. The Linear search starts from the first element 12, since the value to be searched is not 12
(value 45), the next element 18 is compared and is also not 45, by this way all the elements
before 45 are compared and when the index is 5, the element 45 is compared with the search
value and is equal, hence the element is found and the element position is 5.

In a linear search the search is done over the entire list even if the element to be searched
is not available. Some of our improvements work to minimize the cost of traversing the whole
data set, but those improvements only cover up what is really a problem with the algorithm.

By thinking of the data in a different way, we can make speed improvements that are
much better than anything linear search can guarantee. Consider a list in sorted order. It would
work to search from the beginning until an item is found or the end is reached, but it makes more
sense to remove as much of the working data set as possible so that the item is found more
quickly.

If it started at the middle of the list it could determine which half the item is in (because
the list is sorted). This effectively divides the working range in half with a single test. This in
turn reduces the time complexity.
Algorithm:
bool Binary_Search ( int *list, int size, int key, int* rec )

bool found = false;

int low = 0, high = size - 1;

while ( high >= low )

int mid = ( low + high ) / 2;

if ( key < list[mid] )

high = mid - 1;

else

if ( key > list[mid] )

low = mid + 1;

else

found = true;

rec = &list[mid];

break;

return found;

Binary Search

Binary search is also known as binary chop, as the data set is cut into two halves for each
step of the process. It is a very much faster search method than linear search, but to be effective
the data set must be in sorted order in the array. If the data set changes rapidly and requires
regular re-sorting then this will offset the speed gain offered by binary search over linear search.
To perform binary search, three index variables are required. By tradition these are called 'top',
'middle' and 'bottom'.

Top is initialized to one end of the array, often 0, and bottom is set to indicate the other end of
the array.

Once these two variables are set, the value of middle can be computed. Middle is set to the
midway value between top and bottom.

The value indexed by middle is compared with the target value. There are initially three possible
outcomes that we have to consider:

1: The value indexed by middle matches the target. In this case the search has found the
target and the function can return a value indicating that the search has succeeded.

2. The value is higher than the middle value in which case only the values from middle to
end need to be searched

3. The value is lower than the middle value in which case the search is carried out
between zero and the middle value.

Binary Search: Illustration

To illustrate this process, consider the following scenario - the data in the array is sorted
and the target is 29. We start by setting top to 9, bottom to 0, and calculating middle to be (0 + 9)
/ 2. This round down to 4 using C integer arithmetic, so middle is set to 4.

From this we can conclude that the target value is in the lower half of the table. This
means that top must be set to middle and a new value of middle calculated. In this case the value
of middle will be (4 + 0)/2 which C will deliver as 2. The contents of array element 2 match the
target, so in this case the search is successfully concluded.

Binary Search: Practical Issues

A couple of practical issues present themselves at this point:

 If top == bottom then the search has concluded. Unless the value at middle (and middle
must be the same as top and bottom) is the target, then the search has determined that the
target is not in the data set.

 Unless some care is taken then the search may end in a loop with top equal to (bottom +
1). There are circumstances where this loop does not terminate.

Hash Function - Wikipedia
No ratings yet
Hash Function - Wikipedia
44 pages
Hash Tables: A Detailed Description
No ratings yet
Hash Tables: A Detailed Description
10 pages
Unit 1 Dsa Hashing 2022 Compressed 1
No ratings yet
Unit 1 Dsa Hashing 2022 Compressed 1
115 pages
Lecture 8 Hashing
No ratings yet
Lecture 8 Hashing
47 pages
Unit 3.4 Hashing Techniques
No ratings yet
Unit 3.4 Hashing Techniques
7 pages
UNIT - 2 Notes
No ratings yet
UNIT - 2 Notes
40 pages
Colossion in Hasing
No ratings yet
Colossion in Hasing
22 pages
Notes of Advanced Data Structures
No ratings yet
Notes of Advanced Data Structures
202 pages
Dat Astruc T Hashing Rep
No ratings yet
Dat Astruc T Hashing Rep
13 pages
Hashing
No ratings yet
Hashing
9 pages
Hashing PPT For Student
No ratings yet
Hashing PPT For Student
53 pages
Unit 1 Dsa Hashing 2024 1
No ratings yet
Unit 1 Dsa Hashing 2024 1
146 pages
Lecture 7 - Hash - Table - Direct - Adreess - Tables - Hash - Tables - Intro - Separate - Chaining
No ratings yet
Lecture 7 - Hash - Table - Direct - Adreess - Tables - Hash - Tables - Intro - Separate - Chaining
77 pages
Module 5 Hashing
No ratings yet
Module 5 Hashing
66 pages
Unit 7
No ratings yet
Unit 7
27 pages
Unit 1 Dsa Hashing
No ratings yet
Unit 1 Dsa Hashing
137 pages
Hash
No ratings yet
Hash
10 pages
Hashing
No ratings yet
Hashing
25 pages
UNIT 1 - Hashing
No ratings yet
UNIT 1 - Hashing
118 pages
UNIT 1 - Hashing
No ratings yet
UNIT 1 - Hashing
118 pages
Week 9 - Hash Functions and Collision
No ratings yet
Week 9 - Hash Functions and Collision
73 pages
Unit 3 Hashing
No ratings yet
Unit 3 Hashing
23 pages
Hashing and Skiplist - Removed
No ratings yet
Hashing and Skiplist - Removed
113 pages
DS Module-X
No ratings yet
DS Module-X
74 pages
Hashing
No ratings yet
Hashing
35 pages
Hashing
No ratings yet
Hashing
48 pages
Final Hashing
No ratings yet
Final Hashing
41 pages
CH 4
No ratings yet
CH 4
58 pages
CH 4 Hash Table
No ratings yet
CH 4 Hash Table
20 pages
Hashing
No ratings yet
Hashing
25 pages
Unit 5 Data Structure
No ratings yet
Unit 5 Data Structure
12 pages
Unit 5 Session 5 Hashing
No ratings yet
Unit 5 Session 5 Hashing
20 pages
Hashing
No ratings yet
Hashing
30 pages
Lab 3
No ratings yet
Lab 3
5 pages
Dsa 4
No ratings yet
Dsa 4
55 pages
Handout 9 - Hashing
No ratings yet
Handout 9 - Hashing
11 pages
Values, Hash Codes, Hash Sums, Checksums or Simply Hashes.: From Wikipedia, The Free Encyclopedia
100% (1)
Values, Hash Codes, Hash Sums, Checksums or Simply Hashes.: From Wikipedia, The Free Encyclopedia
11 pages
Hashng Notes SVIMS
No ratings yet
Hashng Notes SVIMS
14 pages
MODULE 5 - BCS304 - HASHING - Leftisht Trees - OBST - Notes
No ratings yet
MODULE 5 - BCS304 - HASHING - Leftisht Trees - OBST - Notes
32 pages
DSA G5 Hashing Handouts
No ratings yet
DSA G5 Hashing Handouts
7 pages
Lecture 27 - Hashing
No ratings yet
Lecture 27 - Hashing
48 pages
HASHING
No ratings yet
HASHING
8 pages
As 3
No ratings yet
As 3
4 pages
Module 5
No ratings yet
Module 5
33 pages
Cse373 10 Hashing
No ratings yet
Cse373 10 Hashing
36 pages
Hashing Slide
No ratings yet
Hashing Slide
16 pages
Unit 6.2 Indexing and Hashing
No ratings yet
Unit 6.2 Indexing and Hashing
37 pages
Idst 2016 SA 05 Hashing
No ratings yet
Idst 2016 SA 05 Hashing
68 pages
ADI Hashing
No ratings yet
ADI Hashing
47 pages
L5 HashTables
No ratings yet
L5 HashTables
22 pages
Chapter05 1 PDF
No ratings yet
Chapter05 1 PDF
19 pages
Hashing
No ratings yet
Hashing
34 pages
FDMEE Training Final
100% (1)
FDMEE Training Final
77 pages
Handout 8 - Hashing
No ratings yet
Handout 8 - Hashing
9 pages
DBMS Unit-3 Notes
No ratings yet
DBMS Unit-3 Notes
9 pages
Hashing and Indexing
No ratings yet
Hashing and Indexing
28 pages
DSA MK Lect2 PDF
No ratings yet
DSA MK Lect2 PDF
92 pages
Hash Function Instruction Count
No ratings yet
Hash Function Instruction Count
6 pages
Hash Function
No ratings yet
Hash Function
9 pages
Hash Function - Wikipedia, The Free Encyclopedia
No ratings yet
Hash Function - Wikipedia, The Free Encyclopedia
5 pages
DB2 Problem Determination Using Db2top Utility
100% (3)
DB2 Problem Determination Using Db2top Utility
40 pages
Abstract - For - Supermarket Billing System
67% (3)
Abstract - For - Supermarket Billing System
3 pages
Image Processing With TensorFlow
100% (2)
Image Processing With TensorFlow
29 pages
LG r480 - Quanta Ql3 Preso-II - Rev 1a PDF
No ratings yet
LG r480 - Quanta Ql3 Preso-II - Rev 1a PDF
39 pages
Excel Formulas Examples
No ratings yet
Excel Formulas Examples
67 pages
Dbms
No ratings yet
Dbms
58 pages
Google Slides
No ratings yet
Google Slides
32 pages
Unit Ii
No ratings yet
Unit Ii
87 pages
Sem 4 PDF
No ratings yet
Sem 4 PDF
18 pages
Unit 1
No ratings yet
Unit 1
19 pages
CIS SUSE Linux Enterprise 11 Benchmark v2.1.0
No ratings yet
CIS SUSE Linux Enterprise 11 Benchmark v2.1.0
377 pages
Graph Traversal Tech
No ratings yet
Graph Traversal Tech
47 pages
Introduction To Graphs-Representation-Traversals
No ratings yet
Introduction To Graphs-Representation-Traversals
13 pages
Backtracking N Queens
No ratings yet
Backtracking N Queens
41 pages
Arista Cloudvision®: Cloud Automation For Everyone: White Paper
No ratings yet
Arista Cloudvision®: Cloud Automation For Everyone: White Paper
15 pages
Final Report Data Mining
No ratings yet
Final Report Data Mining
60 pages
Web Tech Notes Unit 3,4,5
No ratings yet
Web Tech Notes Unit 3,4,5
20 pages
Docu51004 Data Domain Operating System Command Reference Guide 5.4
No ratings yet
Docu51004 Data Domain Operating System Command Reference Guide 5.4
250 pages
Mech Nptel 2020-21
No ratings yet
Mech Nptel 2020-21
33 pages
Specifying FDT Technology: Putting Your Assets To Work!
No ratings yet
Specifying FDT Technology: Putting Your Assets To Work!
6 pages
External Fragmentation
No ratings yet
External Fragmentation
20 pages
Stanford Sample
No ratings yet
Stanford Sample
4 pages
SAS - Statistical Analysis System
No ratings yet
SAS - Statistical Analysis System
34 pages
CS Quiz 1
No ratings yet
CS Quiz 1
28 pages
MSP430fg4618 Lab Manual
No ratings yet
MSP430fg4618 Lab Manual
89 pages
ASAL IT Decisiontree Final
No ratings yet
ASAL IT Decisiontree Final
1 page
Variables, Expressions, and Statements
No ratings yet
Variables, Expressions, and Statements
12 pages
Section 1: Clinics and Departments
No ratings yet
Section 1: Clinics and Departments
6 pages
Notification 15.03.2017
No ratings yet
Notification 15.03.2017
33 pages
Order Management: Table Joins
No ratings yet
Order Management: Table Joins
5 pages
Concurrenthashmap in Java 8: Parallelismthreshold
No ratings yet
Concurrenthashmap in Java 8: Parallelismthreshold
5 pages
WWW - Sap Img - Com Sap Implementation
No ratings yet
WWW - Sap Img - Com Sap Implementation
3 pages
Methods of Qual-WPS Office
No ratings yet
Methods of Qual-WPS Office
2 pages
Director Program Management IT in Philadelphia PA Resume Edward Mayer
No ratings yet
Director Program Management IT in Philadelphia PA Resume Edward Mayer
2 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet

Unit Iii

Uploaded by

Unit Iii

Uploaded by

UNIT –IV

HASHING AND SORTING

4.1 Hash Tables and Direct Address Tables

 Iteration order for hash tables by augmenting the structure

Time complexity and common uses of hash tables

2.8 Hash Function-Open Addressing

uint32 joaat_hash(uchar *key, size_t len)

for (i = 0; i < len; i++)

Hash collision resolved by linear probing (interval=1).

1.Linear probing in which the interval between probes is fixed—often at 1,

record pair { key, value }

function set(key, value)

// key_1,key_2,key_3,key_4 are following 3-digit numbers - parts of ip address xxx.xxx.xxx.xxx

Open addressing versus chaining

Ordered retrieval issue

Problems with hash tables

Other hash table algorithms

SELECT * from customer, orders where customer.cust_id = orders.cust_id and cust_id = X

a2 = old block number + ( N times 2 ^ (level) )

4.5 Bubble sort:

 Bubble is a simple and well-known sorting algorithm.

Bubble Sort program:

/***** Program to Sort an Array using Bubble Sort *****/

4.6 Selection sort:

The algorithm work as follows:

1. Set the first position as current position

Implementation Selection Sort :

4.7 Insertion Sort:

 Insertion sort algorithm somewhat resembles selection sort.

Example. Sort {7, -5, 2, 16, 4} using insertion sort.

Shifting instead of swapping

It is the most commonly used modification of the insertion sort.

Using binary search

Insert (Arr, N) where ARR is an array of N elements.

/**** Program to Sort an Array using Insertion Sort ****/

4.8 Shell sort

 Shell sort is an extension of insertion sort

Apply insertion sort to sort the three lists

Apply insertion sort on each sublist

Apply insertion sort to sort the list

The list is now sorted

Shell sort routine:

 Radix sort is a clever and intuitive little sorting algorithm.

Digit Number (submit)

 Now, the sub lists (number) are gathered in order from 0 to 9

 Consider the input sequence 31,41,59,26,53,58,97

Partition algorithm in detail

values after jth element are greater or equal to the pivot.

Why does it work?

4.12 Merge Sort

1. It is based on the divide and conquer approach

 Let us sort this unsorted list.

 Start merging the sublists to obtain a sorted list.

 Further merge the sublists.

 Again, merge the sublists.

all of them require the same amount of time.

MERGE SORT Program

/**** Program to Sort an Array using Merge Sort ****/

void merge_array(int a[], int beg, int mid, int end)

4.13 Searching- Linear search, Binary search

The item being searched for will be referred to as the target.

Algorithm : Linear search implementation

bool found = false;

int low = 0, high = size - 1;

while ( high >= low )

int mid = ( low + high ) / 2;

if ( key < list[mid] )

if ( key > list[mid] )

Binary Search: Illustration

Binary Search: Practical Issues

A couple of practical issues present themselves at this point:

You might also like

/* Program to Sort an Array using Bubble Sort */

/ Program to Sort an Array using Insertion Sort /

/ Program to Sort an Array using Merge Sort /