STRING DATA TYPE
PRESENTED BY: CHRISTOPHER HODOH
• A string is generally considered as a data type and is often implemented as
an array data structure of bytes (or words) that stores a sequence of
elements, typically characters, using some character encoding. String may
also denote more general arrays or other sequence (or list) data types and
structures.
• A string datatype is a datatype modeled on the idea of a formal string. Strings are
such an important and useful datatype that they are implemented in nearly every
programming language. In some languages they are available as primitive types and
in others as composite types. The syntax of most high-level programming languages
allows for a string, usually quoted in some way, to represent an instance of a string
datatype; such a meta-string is called a literal or string literal.
STRING LENGTH
Although formal strings can have an arbitrary finite length, the length of strings in real languages is often
constrained to an artificial maximum. In general, there are two types of string datatypes: fixed-length strings,
which have a fixed maximum length to be determined at compile time and which use the same amount of
memory whether this maximum is needed or not, and variable-length strings, whose length is not arbitrarily
fixed and which can use varying amounts of memory depending on the actual requirements at run time.
Most strings in modern programming languages are variable-length strings. Of course, even variable-length
strings are limited in length – by the size of available computer memory. The string length can be stored as
a separate integer (which may put another artificial limit on the length) or implicitly through a termination
character, usually a character value with all bits zero such as in C programming language.
CHARACTER ENCODING
• String datatypes have historically allocated one byte per character, and,
although the exact character set varied by region, character encodings
were similar enough that programmers could often get away with ignoring
this, since characters a program treated specially (such as period and space
and comma) were in the same place in all the encodings a program would
encounter. These character sets were typically based on ASCII or EBCDIC. If
text in one encoding was displayed on a system using a different encoding,
text was often mangled, though often somewhat readable and some
computer users learned to read the mangled text.
IMPLEMENTATIONS
• Some languages, such as C++ and Ruby, normally allow the contents of a
string to be changed after it has been created; these are
termed mutable strings. In other languages, such as Java and Python, the
value is fixed and a new string must be created if any alteration is to be
made; these are termed immutable strings (some of these languages also
provide another type that is mutable, such as Java and .NET
• Strings are typically implemented as arrays of bytes, characters, or code
units, in order to allow fast access to individual units or substrings—including
characters when they have a fixed length. A few languages such as Haskell
implement them as linked lists instead. Some languages, such as Prolog and
Erlang, avoid implementing a dedicated string datatype at all, instead
adopting the convention of representing strings as lists of character codes.
REPRESENTATIONS
• Representations of strings depend heavily on the choice of character
repertoire and the method of character encoding. Older string
implementations were designed to work with repertoire and encoding
defined by ASCII. Modern implementations often use the extensive
repertoire defined by Unicode along with a variety of complex
encodings such as UTF-8 and UTF-16.
• The term byte string usually indicates a general-purpose string of
bytes, rather than strings of only (readable) characters, strings of bits,
or such. Byte strings often imply that bytes can take any value and any
data can be stored as-is, meaning that there should be no value
interpreted as a termination value.
NULL-TERMINATED
• The length of a string can be stored implicitly by using a special terminating
character; often this is the null character (NUL), which has all bits zero, a
convention used and perpetuated by the popular C programming language
. Hence, this representation is commonly referred to as a C string. This
representation of an n-character string takes n + 1 space (1 for the
terminator), and is thus an implicit data structure.
• In terminated strings, the terminating code is not an allowable character in
any string. Strings with length field do not have this limitation and can also
store arbitrary binary data.
• An example of a null-terminated string stored in a 10-byte buffer, along with
its ASCII (or more modern UTF-8) representation as 8-bit
hexadecimal numbers is:
F R A N K NU k e f w
L
46
5216 4116 4E16 4B16 0016 6B16 6516 6616 7716
16