M4 - Chapter 7 1
M4 - Chapter 7 1
• Space and Time Trade-offs in computer science is a case where an algorithm or program trades
increased space usage with decreased time and vice-versa.
• The idea is to preprocess the problem’s input, in whole or in part, and store the additional information
obtained to accelerate solving the problem afterward. This approach is called as input enhancement.
• Example: 1) Counting methods for sorting.
2) Boyer-Moore algorithm for string
• The other type of technique that exploits space-for-time trade-offs simply uses extra space to facilitate
faster and/or more flexible access to the data. We call this approach pre-structuring.
• Example: 1) hashing.
2) indexing with B-trees
• There is one more algorithm design technique related to the space-for-time trade-off idea: dynamic
programming. This strategy is based on recording solutions to overlapping subproblems of a given
problem in a table from which a solution to the problem in question is then obtained.
• Example: Fibonacci series, F(n) = F(n-1) + F(n-2)
1
1. Sorting by Counting
2
Example of sorting by comparison counting
Array A [0…5] 62 31 84 96 19 47
Initially Count [] 0 0 0 0 0 0
After Pass i=0 Count [] 3 0 1 1 0 0
0
After Pass i=1 Count [] 1 2 2 1
After Pass i=2 Count [] 4 3 0 1
After Pass i=3 Count [] 5 0 1
After Pass i=4 Count [] 0 2
Final Pass Count [] 3 1 4 5 0 2
3
What is the time efficiency of this algorithm?
• It should be quadratic because the algorithm considers all the different pairs of an
n-element array. More formally, the number of times its basic operation, the
comparison A[i] < A[j ], is executed is equal to the sum we have encountered
several times already:
4
➢Distribution counting.
• Let us consider a more realistic situation of sorting a list of items with some other information associated with
their keys so that we cannot overwrite the list’s elements.
• Then we can copy elements into a new array S[0..n − 1] to hold the sorted list as follows.
The elements of A whose values are equal to the lowest possible value ‘l’ are copied into the first F[0]
elements of ‘S’, i.e., positions 0 through F[0]− 1; the elements of value l + 1 are copied to positions from
F[0] to (F[0]+ F[1]) − 1; and so on.
• Since such accumulated sums of frequencies are called a distribution in statistics, the method itself is known
as distribution counting.
5
Assuming that the range of array values is fixed, this is obviously a linear algorithm because it makes just two consecutive
passes through its input array A. This is a better time-efficiency class than that of the most efficient sorting algorithms—
mergesort, quicksort, and heapsort—we have encountered. It is important to remember, however, that this efficiency is
obtained by exploiting the specific nature of inputs for which sorting by distribution counting works, in addition to trading
space for time.
6
Tutorial:
1. Sort the given elements using sorting by counting method.
a. 50, 60, 70, 40, 30, 20, 10, 80, 100
b. 21, 26, 30, 9, 4, 14, 28, 18, 15, 10, 2, 3, 7
c. 14, 17, 11, 7, 53, 4, 13, 12, 8, 60, 19, 16, 20
2. Sort the given elements using distribution counting
a. 10, 8, 10, 5, 8, 10, 8
b. 1, 2, 1, 2, 3, 1, 1, 5, 3, 2, 1
https://www.youtube.com/watch?v=W4h6555g5qo
P U T
• In Horspool Algorithm, a Shift Table has to be constructed which will contain the number of shifts to the
right.
8
• In general, the following four possibilities can occur.
• These examples clearly demonstrate that right-to-left character comparisons can lead to
farther shifts of the pattern than the shifts by only one position always made by the
brute-force algorithm.
9
10
Horspool’s algorithm
Step 1 For a given pattern of length m and the alphabet used in both the
pattern and text, construct the shift table as described above.
Step 2 Align the pattern against the beginning of the text.
Step 3 Repeat the following until either a matching substring is found or the
pattern reaches beyond the last character of the text. Starting with the last
character in the pattern, compare the corresponding characters in the pattern
and text until either all m characters are matched (then stop) or a
mismatching pair is encountered. In the latter case, retrieve the entry t (c)
from the c’s column of the shift table where c is the text’s character currently
aligned against the last character of the pattern, and shift the pattern by t (c)
characters to the right along the text.
Example:
• Consider the string COMPUTER SCIENCE ENGINEERING and search for the pattern ENGINEER in this string using
Horspool algorithm.
• The shift table for ENGINEER is :
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8
7 5 4 6
2 3
1
The basic idea of Horspool is to shift the pattern depending upon whether the last character in the pattern matches a
particular character in the given string
C O M P U T E R S C I E N C E E N G I N E E R I N G
E N G I N E E R
E N G I N E E R
E N G I N E E R
E N G I N E E R
Note: In the shift table, the updated value of current characters of pattern is the next total characters to the
current character in the pattern .
12
• A simple example can demonstrate that the worst-case efficiency of Horspool’s
algorithm is in O(nm). But for random texts, it is in O(n), and, although in the same
efficiency class, Horspool’s algorithm is obviously faster on average than the brute-
force algorithm.
13
14
Length of the alphabet
15
https://www.youtube.com/watch?v=JITD8C2wLQY
b. Boyer-Moore Algorithm
Step 1 For a given pattern and the alphabet used in both the pattern and When searching for the first occurrence of
the text, construct the bad-symbol shift table. the pattern, the worst-case efficiency of the
Step 2 Using the pattern, construct the good-suffix shift table. Boyer-Moore algorithm is known to be
Step 3 Align the pattern against the beginning of the text. linear (O(n)). Though this algorithm runs
Step 4 Repeat the following step until either a matching substring is found very fast, especially on large alphabets
or the pattern reaches beyond the last character of the text. Starting with (relative to the length of the pattern), many
the last character in the pattern, compare the corresponding characters in people prefer its simplified versions, such
the pattern and the text until either all m character pairs are matched (then as Horspool’s algorithm, when dealing with
stop) or a mismatching pair is encountered after k ≥ 0 character pairs are natural-language–like strings (Applications
matched successfully. In the latter case, retrieve the entry t1(c) from the like spelling and grammar checkers, spam
c’s column of the bad-symbol table where c is the text’s mismatched detectors, translation, and sentiment
character. If k > 0, also retrieve the corresponding d2 entry from the good- analysis).
suffix table. Shift the pattern to the right by the number of positions
computed by the formula
21
Tutorial:
22
3. Hashing
• Typically, records comprise several fields, each responsible for
keeping a particular type of information about an entity the
record represents. For example, a student record may contain
fields for the student’s ID, name, date of birth, home address,
major, and so on. Among record fields there is usually at least one
called a key that is used for identifying entities represented by the
records (e.g., the student’s ID). In the discussion below, we
assume that we have to implement a dictionary of n records with
keys K1, K2,...,Kn.
• Hashing is the transformation of a string of character into a
usually shorter fixed-length value or key that represents the
original string. Hashing is used to index and retrieve items in a
database because it is faster to find the item using the shortest
hashed key than to find it using the original value.
• Hashing is based on the idea of distributing keys among a one-
dimensional array H[0..m − 1] called a hash table. The
distribution is done by computing, for each of the keys, the value
of some predefined function h called the hash function. This
function assigns an integer between 0 and m − 1, called the hash
address, to a key.
23
24
Two principal versions of hashing:
• Open hashing
(also called separate chaining)
• Closed hashing
(also called open
addressing/linear probing)
In open hashing, keys are stored in linked lists attached to cells of a hash table. Each list contains all the keys
hashed to its cell. Consider, as an example, the following list of words: A, FOOL, AND, HIS, MONEY,
ARE, SOON, PARTED. As a hash function, we will use the simple function for strings mentioned above, i.e.,
we will add the positions of a word’s letters in the alphabet and compute the sum’s remainder after division by
13. We start with the empty table. The first key is the word A; its hash value is h(A) = 1 mod 13 = 1. The
second key—the word FOOL—is installed in the ninth cell since (6 + 15 + 15 + 12) mod 13 = 9, and so on.
There is a collision of the keys ARE and SOON because h(ARE) = (1 + 18 + 5) mod 13 = 11 and h(SOON) =
(19 + 15 + 15 + 14) mod 13 = 11.
25
How do we search in a dictionary implemented as such a table of linked lists?
• We do this by simply applying to a search key the same procedure that was used for creating the table. To
illustrate, if we want to search for the key KID in the hash table, we first compute the value of the same hash
function for the key: h(KID) = 11. Since the list attached to cell 11 is not empty, its linked list may contain
the search key. But because of possible collisions, we cannot tell whether this is the case until we traverse
this linked list. After comparing the string KID first with the string ARE and then with the string SOON, we
end up with an unsuccessful search.
• In general, the efficiency of searching depends on the lengths of the linked lists, which, in turn, depend on
the dictionary and table sizes, as well as the quality of the hash function. If the hash function distributes ‘n’
keys among ‘m’ cells of the hash table about evenly, each list will be about ‘n/m’ keys long. The ratio α =
n/m, called the load factor of the hash table, plays a crucial role in the efficiency of hashing. In particular,
the average number of pointers (chain links) inspected in successful searches (S), and unsuccessful
searches (U), turns out to be
respectively, under the standard
assumptions of searching for a randomly selected element and a hash function distributing keys
uniformly among the table’s cells.
• The two other dictionary operations—insertion and deletion—are almost identical to searching. Insertions
are normally done at the end of a list. Deletion is performed by searching for a key to be deleted and then
removing it from its list. Hence, the efficiency of these operations is identical to that of searching, and they
are all Θ(1) in the average case if the number of keys ‘n’ is about equal to the hash table’s size ‘m’.
2. Closed hashing (also called open addressing)
• In closed hashing, all keys are stored in the hash table itself without the use of linked lists. (Of
course, this implies that the table size m must be at least as large as the number of keys n.)
Different strategies can be employed for collision resolution. The simplest one—called linear
probing—checks the cell following the one where the collision occurs. If that cell is empty, the
new key is installed there; if the next cell is already occupied, the availability of that cell’s
immediate successor is checked, and so on. Note that if the end of the hash table is reached, the
search is wrapped to the beginning of the table; i.e., it is treated as a circular array.
To search for a given key K, we start by computing h(K) where h is the hash function used in the table
construction. If the cell h(K) is empty, the search is unsuccessful. If the cell is not empty, we must compare K
with the cell’s occupant: if they are equal, we have found a matching key; if they are not, we compare K with
a key in the next cell and continue in this manner until we encounter either a matching key (a successful
search) or an empty cell (unsuccessful search). For example, if we search for the word LIT in the above
table, we will get h(LIT) = (12 + 9 + 20) mod 13 = 2 and, since cell 2 is empty, we can stop immediately.
However, if we search for KID with h(KID) = (11 + 9 + 4) mod 13 = 11, we will have to compare KID with
ARE, SOON, PARTED, and A before we can declare the search unsuccessful.
27
• Although the search and insertion operations are straightforward for this version of hashing, deletion is
not. For example, if we simply delete the key ARE from the last state of the hash table in the above figure,
we will be unable to find the key SOON afterward. Indeed, after computing h(SOON) = 11, the algorithm
would find this location empty and report the unsuccessful search result. A simple solution is to use “lazy
deletion,” i.e., to mark previously occupied locations by a special symbol to distinguish them from
locations that have not been occupied, or in this method, deletions are done by marking an element as
deleted, rather than erasing it entirely.
• The mathematical analysis of linear probing is a much more difficult problem than that of separate
chaining. The simplified versions of these results state that the average number of times the algorithm must
access the hash table with the load factor α in successful and unsuccessful searches is, respectively,
(and the accuracy of these approximations increases with larger sizes of the hash table). These numbers are
surprisingly small even for densely populated tables, i.e., for large percentage values of α:
28
• A cluster in linear probing is a sequence of contiguously occupied cells (with a possible wrapping).
For example, the final state of the above hash table has two clusters. Clusters are bad news in hashing
because they make the dictionary operations less efficient. As clusters become larger, the probability
that a new element will be attached to a cluster increases; in addition, large clusters increase the
probability that two clusters will combine after a new key’s insertion, causing even more clustering.
• Several other collision resolution strategies have been suggested to reduce this problem. One of the
most important is double hashing. Under this scheme, we use another hash function, s(K), to
determine a fixed increment for the probing sequence to be used after a collision at location l = h(K):
• To guarantee that every location in the table is probed by the above sequence, the increment s(k) and
the table size m must be relatively prime, i.e., their only common divisor must be 1. (This condition is
satisfied automatically if m itself is prime.) Some functions recommended in the literature are s(k) = m
− 2 − k mod (m − 2) and s(k) = 8 − (k mod 8) for small tables and s(k) = k mod 97 + 1 for larger ones.
• Mathematical analysis of double hashing has proved to be quite difficult. Some partial results and
considerable practical experience with the method suggest that with good hashing functions—both
primary and secondary—double hashing is superior to linear probing. But its performance also fails
when the table gets close to being full. A natural solution in such a situation is rehashing: the current
table is scanned, and all its keys are relocated into a larger table.
29
It is worthwhile to compare the main properties of hashing with balanced search
trees—its principal competitor for implementing dictionaries.
• Asymptotic time efficiency: With hashing, searching, insertion, and deletion can
be implemented to take Θ(1)time on the average but Θ(n)time in the very
unlikely worst case. For balanced search trees, the average time efficiencies are
Θ(log n) for both the average and worst cases.
• Ordering preservation: Unlike balanced search trees, hashing does not assume
existence of key ordering and usually does not preserve it. This makes hashing
less suitable for applications that need to iterate over the keys in order or require
range queries such as counting the number of keys between some lower and
upper bounds
By:
Dr. Geeta S Hukkeri
30