0% found this document useful (0 votes)
33 views

What's The Difference Between ASCII and Unicode - Stack Overflow

ASCII n Unicodec aint the same

Uploaded by

maestrodevil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

What's The Difference Between ASCII and Unicode - Stack Overflow

ASCII n Unicodec aint the same

Uploaded by

maestrodevil
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

What's the difference between ASCII Ask Question

and Unicode?

Can I know the exact


difference between Unicode
and ASCII?

ASCII has a total of 128


characters (256 in the
extended set).

Is there any size


specification for Unicode
characters?

unicode ascii

edited Nov 10 '15 at 21:30


nbro
5,461 8 46 89

asked Oct 6 '13 at 18:25


Ashvitha
1,555 5 12 15

3 tugay.biz/2016/07/what­is­
ascii­and­unicode­and­
character.html –
Koray Tugay Jul 10 '16 at
18:08

7 Answers

ASCII defines 128


characters, which map to
the numbers 0–127.
Unicode defines (less than)
221 characters, which,
Join Stack
similarly, Overflow
map to learn, share knowledge, and build your career.
to numbers
0–221 (though not all
numbers are currently
Sign Up
assigned, and some are
reserved).

Unicode is a superset of
ASCII, and the numbers 0–
128 have the same meaning
in ASCII as they have in
Unicode. For example, the
number 65 means "Latin
capital 'A'".

Because Unicode
characters don't generally fit
into one 8­bit byte, there are
numerous ways of storing
Unicode characters in byte
sequences, such as UTF­32
and UTF­8.

edited Dec 21 '17 at 12:49

answered Oct 6 '13 at 18:29


Kerrek SB
356k 60 669 900

10 128 codes in ASCII. 127


is DEL. – Stephen C Nov
5 '14 at 3:26

104 "Unicode is a superset of


ASCII" , good point –
refactor Oct 25 '15 at
17:12

2 @riderBill: What now?


Which 3 bits are you
talking about? There are
no bits in Unicode. Just
codepoints. – Kerrek SB
Feb 22 '16 at 22:14

6 @riderBill: Unicode does


not "use between 1 and 4
bytes". Unicode is an
assignment of meaning to
numbers. It doesn't use
any bytes. There are
certain standardized
encoding schemes to
represent Unicode to learn, share knowledge, and build your career.
codepoints as a stream of
bytes, but they are
orthogonal to Unicode as
a character set. (Yes, feel
free to delete as you
please.) – Kerrek SB Mar
7 '16 at 13:18

5 To clarify, Unicode
character set itself is a
superset of ISO­8859­1
character set, but UTF­8
encoding is not a
superset of ISO­8859­1
encoding but ASCII
encoding. – minmaxavg
Jun 27 '16 at 5:29

Understanding why ASCII


and Unicode were created
in the first place helped me
understand how they
actually work.

ASCII, Origins

As stated in the other


answers, ASCII uses 7 bits
to represent a character. By
using 7 bits, we can have a
maximum of 2^7 (= 128)
distinct combinations*.
Which means that we can
represent 128 characters
maximum.

Wait, 7 bits? But why


not 1 byte (8 bits)?

The last bit (8th) is used


for avoiding errors as
parity bit. This was
relevant years ago.

Most ASCII characters are


printable characters of the
alphabet such as abc, ABC,
123, ?&!, etc. The others
are control characters such
as carriage return, line feed,
tab, etc.
to learn, share knowledge, and build your career.
See below the binary
representation of a few
characters in ASCII:
0100101 -> % (Percent Sign
1000001 -> A (Capital lett
1000010 -> B (Capital lett
1000011 -> C (Capital lett
0001101 -> Carriage Return

See the full ASCII table over


here.

ASCII was meant for


English only.

What? Why English


only? So many
languages out there!

Because the center of


the computer industry
was in the USA at that
time. As a consequence,
they didn't need to
support accents or other
marks such as á, ü, ç, ñ,
etc. (aka diacritics).

ASCII Extended

Some clever people started


using the 8th bit (the bit
used for parity) to encode
more characters to support
their language (to support
"é", in French, for example).
Just using one extra bit
doubled the size of the
original ASCII table to map
up to 256 characters (2^8 =
256 characters). And not
2^7 as before (128).

10000010 -> é (e with acut


10100000 -> á (a with acut

The name for this "ASCII


extended to 8 bits and not 7
bits as before" could be just
referred as "extended
ASCII" or "8­bit ASCII".

As @Tom pointed outto in learn,


his share knowledge, and build your career.
comment below there is no
such thing as "extended
ASCII" yet this is an easy
way to refer to this 8th­bit
trick. There are many
variations of the 8­bit ASCII
table, for example, the ISO
8859­1, also called ISO
Latin­1.

Unicode, The Rise

ASCII Extended solves the


problem for languages that
are based on the Latin
alphabet... what about the
others needing a completely
different alphabet? Greek?
Russian? Chinese and the
likes?

We would have needed an


entirely new character set...
that's the rational behind
Unicode. Unicode doesn't
contain every character
from every language, but it
sure contains a gigantic
amount of characters (see
this table).

You cannot save text to your


hard drive as "Unicode".
Unicode is an abstract
representation of the text.
You need to "encode" this
abstract representation.
That's where an encoding
comes into play.

Encodings: UTF­8 vs UTF­


16 vs UTF­32

This answer does a pretty


good job at explaining the
basics:

UTF­8 and UTF­16 are


variable length
encodings.
In UTF­8, a character
to learn, share knowledge, and build your career.
may occupy a minimum
of 8 bits.
In UTF­16, a character
length starts with 16
bits.
UTF­32 is a fixed length
encoding of 32 bits.

UTF­8 uses the ASCII set


for the first 128 characters.
That's handy because it
means ASCII text is also
valid in UTF­8.

Mnemonics:

UTF­8: minimum 8 bits.


UTF­16: minimum 16
bits.
UTF­32: minimum and
maximum 32 bits.

Note:

Why 2^7?

This is obvious for some,


but just in case. We have
seven slots available
filled with either 0 or 1
(Binary Code). Each can
have two combinations.
If we have seven spots,
we have 2 * 2 * 2 * 2 * 2
* 2 * 2 = 2^7 = 128
combinations. Think
about this as a
combination lock with
seven wheels, each
wheel having two
numbers only.

Source: Wikipedia and this


great blog post.

edited Nov 19 '17 at 21:59

answered Dec 17 '16 at 12:18


andrew to learn, share knowledge, and build your career.
3,574 1 17 19

5 There is no text but


encoded text. Some
encodings are very
straightforward, particularly
for characters sets with <=
256 codepoints. "Extended
ASCII" is a very ambiguous
term; there are some that
do support Greek, Russian
and/or Polish. ASCII is
insufficient for English text,
which does use á, ü, ç, ñ. I
suspect that it was
designed to support
computer languages rather
than human languages.
Dogmatically, when you
write a file or stream, you
have a character set and
choose an encoding. Your
reader has to get the bytes
and knowledge of which
encoding. Otherwise, the
communication has failed.
– Tom Blodget Dec 18 '16
at 0:39

Thank you very much for


the addendum. I updated
the answer accordingly. –
andrew Dec 30 '16 at
17:12

11 This should be accepted


Answer – Mubasher Apr 6
'17 at 7:46

Thank you. I notice


everywhere ASCII tables
show character codes as
0­127 but UTF­8 tables
show the codes as hex and
not integers. Is there a
reason for this? Why don't
UTF­X tables show 0­
127/255/65535 versus 00­
AF? Does this mean
anything? – wayofthefuture
Jun 16 '17 at 16:31

Thank you for you answer.


Quick question: 'In UTF­16,
a character length starts
with 16 bits' ­­ Does this
mean that alphanumeric
characters can't be
represented by UTF­16
since they are only 8­bit
characters? – Moondra to learn, share knowledge, and build your career.
Aug 14 '17 at 14:20
ASCII has 128 code points,
0 through 127. It can fit in a
single 8­bit byte, the values
128 through 255 tended to
be used for other
characters. With
incompatible choices,
causing the code page
disaster. Text encoded in
one code page cannot be
read correctly by a program
that assumes or guessed at
another code page.

Unicode came about to


solve this disaster. Version 1
started out with 65536 code
points, commonly encoded
in 16 bits. Later extended in
version 2 to 1.1 million code
points. The current version
is 6.3, using 110,187 of the
available 1.1 million code
points. That doesn't fit in 16
bits anymore.

Encoding in 16­bits was


common when v2 came
around, used by Microsoft
and Apple operating
systems for example. And
language runtimes like
Java. The v2 spec came up
with a way to map those 1.1
million code points into 16­
bits. An encoding called
UTF­16, a variable length
encoding where one code
point can take either 2 or 4
bytes. The original v1 code
points take 2 bytes, added
ones take 4.

Another variable length


encoding that's very
common, used in *nix
operating systems and tools
is UTF­8, a code pointtocan
learn, share knowledge, and build your career.
take between 1 and 4 bytes,
the original ASCII codes
take 1 byte the rest take
more. The only non­variable
length encoding is UTF­32,
takes 4 bytes for a code
point. Not often used since it
is pretty wasteful. There are
other ones, like UTF­1 and
UTF­7, widely ignored.

An issue with the UTF­16/32


encodings is that the order
of the bytes will depend on
the endian­ness of the
machine that created the
text stream. So add to the
mix UTF­16BE, UTF­16LE,
UTF­32BE and UTF­32LE.

Having these different


encoding choices brings
back the code page disaster
to some degree, along with
heated debates among
programmers which UTF
choice is "best". Their
association with operating
system defaults pretty much
draws the lines. One
counter­measure is the
definition of a BOM, the
Byte Order Mark, a special
codepoint (U+FEFF, zero
width space) at the
beginning of a text stream
that indicates how the rest
of the stream is encoded. It
indicates both the UTF
encoding and the endianess
and is neutral to a text
rendering engine.
Unfortunately it is optional
and many programmers
claim their right to omit it so
accidents are still pretty
common.

edited Sep 30 '15 at 15:18

to learn, share knowledge, and build your career.

answered Oct 6 '13 at 19:12


Hans Passant
775k 104 1271
2026

java provides support for


Unicode i.e it supports all
world wide alphabets.
Hence the size of char in
java is 2 bytes. And range is
0 to 65535.

edited Jan 7 at 6:28

answered Nov 4 '17 at 6:32


Siddarth Kanted
1,947 15 14

My vote is for the image. It


explains everything. –
Ankush Jain Oct 14 at
17:16

ASCII has 128 code


positions, allocated to
graphic characters and
control characters (control
codes).

Unicode has 1,114,112 code


positions. About 100,000 of
them have currently been
allocated to characters,
to and
learn, share knowledge, and build your career.
many code points have
been made permanently
noncharacters (i.e. not used
to encode any character
ever), and most code points
are not yet assigned.

The only things that ASCII


and Unicode have in
common are: 1) They are
character codes. 2) The 128
first code positions of
Unicode have been defined
to have the same meanings
as in ASCII, except that the
code positions of ASCII
control characters are just
defined as denoting control
characters, with names
corresponding to their ASCII
names, but their meanings
are not defined in Unicode.

Sometimes, however,
Unicode is characterized
(even in the Unicode
standard!) as “wide ASCII”.
This is a slogan that mainly
tries to convey the idea that
Unicode is meant to be a
universal character code the
same way as ASCII once
was (though the character
repertoire of ASCII was
hopelessly insufficient for
universal use), as opposite
to using different codes in
different systems and
applications and for different
languages.

Unicode as such defines


only the “logical size” of
characters: Each character
has a code number in a
specific range. These code
numbers can be presented
using different transfer
encodings, and internally, in
memory, Unicode
characters are usually
represented using one or
to learn, share knowledge, and build your career.
two 16­bit quantities per
character, depending on
character range, sometimes
using one 32­bit quantity per
character.

edited Apr 4 '17 at 18:53


Peter Mortensen
13.2k 19 83 111

answered Oct 6 '13 at 18:51


Jukka K. Korpela
148k 24 175 287

1 I think the most common


encoding for Unicode is
UTF­8 these days. UTF­8
encodes most of the code
points in 1, 2 or 3 bytes. –
Binarus Apr 23 '16 at
14:42

ASCII and Unicode are two


character encodings.
Basically, they are
standards on how to
represent difference
characters in binary so that
they can be written, stored,
transmitted, and read in
digital media. The main
difference between the two
is in the way they encode
the character and the
number of bits that they use
for each. ASCII originally
used seven bits to encode
each character. This was
later increased to eight with
Extended ASCII to address
the apparent inadequacy of
the original. In contrast,
Unicode uses a variable bit
encoding program where
you can choose between
32, 16, and 8­bit encodings.
Using more bits lets you use
more characters at the
expense of larger filestowhile
learn, share knowledge, and build your career.
fewer bits give you a limited
choice but you save a lot of
space. Using fewer bits (i.e.
UTF­8 or ASCII) would
probably be best if you are
encoding a large document
in English.

One of the main reasons


why Unicode was the
problem arose from the
many non­standard
extended ASCII programs.
Unless you are using the
prevalent page, which is
used by Microsoft and most
other software companies,
then you are likely to
encounter problems with
your characters appearing
as boxes. Unicode virtually
eliminates this problem as
all the character code points
were standardized.

Another major advantage of


Unicode is that at its
maximum it can
accommodate a huge
number of characters.
Because of this, Unicode
currently contains most
written languages and still
has room for even more.
This includes typical left­to­
right scripts like English and
even right­to­left scripts like
Arabic. Chinese, Japanese,
and the many other variants
are also represented within
Unicode. So Unicode won’t
be replaced anytime soon.

In order to maintain
compatibility with the older
ASCII, which was already in
widespread use at the time,
Unicode was designed in
such a way that the first
eight bits matched that of
the most popular ASCII
to learn, share knowledge, and build your career.
page. So if you open an
ASCII encoded file with
Unicode, you still get the
correct characters encoded
in the file. This facilitated the
adoption of Unicode as it
lessened the impact of
adopting a new encoding
standard for those who were
already using ASCII.

Summary:

1.ASCII uses an 8-bit enco


2.Unicode is standardized
3.Unicode represents most
4.ASCII has its equivalent

Taken From:
http://www.differencebetwee
n.net/technology/software­
technology/difference­
between­unicode­and­
ascii/#ixzz4zEjnxPhs

answered Nov 23 '17 at 7:14


Nikhil Katre
1,003 11 18

ASCII defines 128


characters, as Unicode
contains a repertoire of
more than 120,000
characters.

answered Aug 16 '15 at 3:33


sphynx888
11 1

protected by
Community ♦ Jan
21 '16 at 14:30
Thank you for your
interest in this
question. Because it
has attracted low­
quality or spam
answers that had to be
removed, postingto anlearn, share knowledge, and build your career.
answer now requires
10 reputation on this
site (the association
bonus does not count).
Would you like to
answer one of these
unanswered questions
instead?

to learn, share knowledge, and build your career.

You might also like