Strings: Steven Skiena
Strings: Steven Skiena
edu/skiena
Character Codes
Character codes are mappings between numbers and the symbols which make up a particular alphabet. The American Standard Code for Information Interchange (ASCII) is a single-byte character code where 27 = 128 characters are specied. Bytes are eight-bit entities; so that means the highest-order bit is left as zero.
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 NUL BS DLE CAN SP ( 0 8 @ H P X h p x 1 9 17 25 33 41 49 57 65 73 81 89 97 105 113 121 SOH HT DC1 EM ! ) 1 9 A I Q Y a i q y 2 10 18 26 34 42 50 58 66 74 82 90 98 106 114 122 STX NL DC2 SUB * 2 : B J R Z b j r z 3 11 19 27 35 43 51 59 67 75 83 91 99 107 115 123 ETX VT DC3 ESC # + 3 ; C K S [ c k s { 4 12 20 28 36 44 52 60 68 76 84 92 100 108 116 124 EOT NP DC4 FS $ , 4 < D L T / d l t 5 13 21 29 37 45 53 61 69 77 85 93 101 109 117 125 ENQ CR NAK GS % 5 = E M U ] e m u } 6 14 22 30 38 46 54 62 70 78 86 94 102 110 118 126 ACK SO SYN RS & . 6 > F N V f n v 7 15 23 31 39 47 55 63 71 79 87 95 103 111 119 127 BEL SI ETB US / 7 ? G O W g o w DEL
Properties of ASCII
Several properties of the design make programming tasks easier: All non-printable characters have either the rst three bits as zero or all seven lowest bits as one. This makes it very easy to eliminate them before displaying junk. Both the upper- and lowercase letters and the numerical digits appear sequentially. Thus we can iterate through all the letters/digits simply by looping from the value of the rst symbol (say, a) to value of the last symbol (say, z).
We can convert a character (say, I) to its rank in the collating sequence (eighth, if A is the zeroth character) simply by subtracting off the rst symbol (A). We can convert (say C) from upper- to lowercase by adding the difference of the upper and lowercase starting character (C-A+a). Similarly, a character x is uppercase if and only if it lies between A and Z. The character code tells us what will happen when naively sorting text les. Which of x or 3 or C appears rst in alphabetical order? Sorting alphabetically means sorting by character code. Using a different collating sequence requires more complicated comparison functions.
Non-printable character codes for new-line (10) and carriage return (13) are designed to delimit the end of text lines. Inconsistent use of these codes is one of the pains in moving text les between UNIX and Windows systems.
Unicode
More modern international character code designs such as Unicode use two or even three bytes per symbol, and can represent virtually any symbol in every language on earth. Older languages, like Pascal, C, and C++, view the char type as virtually synonymous with 8-bit entities. However, good old ASCII remains alive embedded in Unicode. Java, on the other hand, was designed to support Unicode, so characters are 16-bit entities. The upper byte is all zeros when working with ASCII/ISO Latin 1 text.
Representing Strings
Strings are sequences of characters, where order clearly matters. It is important to be aware of how your favorite programming language represents strings, because there are several different possibilities: Null-terminated Arrays C/C++ treats strings as arrays of characters. The string ends the instant it hits the null character \0, i.e., zero ASCII. Failing to end your string explicitly with a null typically extends it by a bunch of unprintable characters.
Array Plus Length Another scheme uses the rst array location to store the length of the string, thus avoiding the need for any terminating null character. Presumably this is what Java implementations do internally. Linked Lists of Characters Text strings can be represented using linked lists, but this is typically avoided because of the high space-overhead associated with having a several-byte pointer for each single byte character.
Which allow efcient checks that the ith character is in fact within the string, thus avoiding out-of-bounds errors? Which allow efcient deletion or insertion of new characters at the ith position? Which representation is used when users are limited to strings of length at most 255, e.g., le names in Windows?
Note that this routine only searches for exact pattern matches. If a letter is capitalized in the pattern but not in the text there is no match. This algorithm runs in O (|p| |q |) time. More complicated but efcient linear-time algorithms exist for substring pattern matching.
/* convert c to upper case -- no error checking */ /* convert c to lower case -- no error checking */
*/ */ s1 */ s1 */
string::operator [](size_type i)
string::append(s) /* append to string */ string::erase(n,m) /* delete a run of characters */ string::insert(size_type n, const string&s) /* insert string s at n */ string::find(s) string::rfind(s) string::first() string::last()
110307 (Doublets)
Build word ladders on a dictionary of strings. How do we represent and traverse the underlying graph? (if necessary, look ahead to Chapter 9)