Programacion Web Parte-4
Programacion Web Parte-4
Character Encodings
Appendix D, Color Names and Values, discusses how computers store information, how
a character-encoding scheme is a table that translates between characters, and how they are
stored in the computer.
The most common character set (or character encoding) in use on computers is The American
Standard Code for Information Interchange (ASCII), which is probably the most widely used
character set for encoding text electronically. You can expect all computers browsing the web
to understand ASCII.
The problem with ASCII is that it supports only the uppercase and lowercase Latin alphabet, the numbers 09, and some extra characters: a total of 128 characters. Table E-1 lists
the printable characters of ASCII. (The other characters are things such as line feeds and
carriage-return characters.)
Table E-1: Printable Characters of ASCII
&
<
>
However, many languages use either accented Latin characters or completely different alphabets. ASCII does not address these characters, so you need to learn about character encodings
if you want to use any non-ASCII characters.
562
Character encodings are also important if you want to use symbols because these cannot be guaranteed to transfer properly between different encodings (from some dashes to some quotation mark
characters). If you do not indicate the character encoding the document is written in, some of the
special characters might not display.
The International Standards Organization created a range of character sets to deal with different
national characters. ISO-8859-1 is commonly used in Western versions of authoring tools such as
Adobe Dreamweaver, as well as applications such as WindowsNotepad, as shown in Table E-2.
Table E-2: ISO Character Sets
Character Set
Description
ISO-8859-1
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
ISO-8859-10
ISO-8859-15
ISO-8859-16
Latin 10
Covering SE Europe, Albanian, Croatian, Hungarian, Polish, Romanian and
Slovenian, plus can be used in French, German, Italian, and Irish Gaelic
ISO-2022-JP
ISO-2022-JP-2
ISO-2022-KR
Character Encodings
563
It is helpful to note that the first 128 characters of ISO-8859-1 match those of ASCII, so you can
safely use those characters as you would in ASCII.
The Unicode Consortium was then set up to devise a way to show all characters of different languages,
rather than have these different, incompatible character codes for different languages.
Therefore, if you want to create documents that use characters from multiple character sets, you can
do so using the single Unicode character encodings. Furthermore, users can view documents written
in different character sets, providing their processor (and fonts) supports the Unicode standards, no
matter what platform they are on or which country they are in. By having the single-character encoding, you can reduce software development costs because the programs do not need to be designed to
support multiple character encodings.
One problem with Unicode is that a lot of older programs were written to support only 8-bit character
sets (limiting them to 256 characters), which is nowhere near the number required for all languages.
Unicode therefore specifies encodings that can deal with a string in special ways to make enough
space for the huge character set it encompasses. These are known as UTF-8, UTF-16, and UTF-32,
as shown in Table E-3.
Table E-3: Unicode Character Sets
Character Set
Description
UTF-8
A Unicode Translation Format that comes in 8-bit units. That is, it comes
in bytes. A character in UTF-8 can be from 1 to 4 bytes, making UTF-8 a
variable width.
UTF-16
A Unicode Translation Format that comes in 16-bit units. That is, it comes in
shorts. It can be 1 or 2 shorts, making UTF-16 a variable width.
UTF-32
A Unicode Translation Format that comes in 32-bit units. That is, it comes in
longs. It is a fixed-width format and is always 1 long in length.
The first 256 characters of Unicode character sets correspond to the 256 characters of ISO-8859-1.
By default, HTML 4 processors should support UTF-8, and XML processors are supposed to support
UTF-8 and UTF-16; therefore, all XHTML-compliant processors should also support UTF-16 (because
XHTML is an application of XML). The HTML5 specification is strongly biased toward UTF-8.
In practice you almost always want to use UTF-8.
For more information on internationalization and different character sets and encodings, see
www.i18nguy.com and the article The Absolute Minimum Every Software Developer Absolutely,
Positively Must Know about Unicode and Character Sets (No Excuses!) at www.joelonsoftware
.com/articles/Unicode.html.