The Unicode Standard, Version 12.0
The Unicode Standard, Version 12.0
To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/.
Many of the designations used by manufacturers and sellers to distinguish their products are claimed
as trademarks. Where those designations appear in this book, and the publisher was aware of a trade-
mark claim, the designations have been printed with initial capital letters or in all capitals.
Unicode and the Unicode Logo are registered trademarks of Unicode, Inc., in the United States and
other countries.
The authors and publisher have taken care in the preparation of this specification, but make no
expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No
liability is assumed for incidental or consequential damages in connection with or arising out of the
use of the information or programs contained herein.
The Unicode Character Database and other files are provided as-is by Unicode, Inc. No claims are
made as to fitness for any particular purpose. No warranties of any kind are expressed or implied.
The recipient agrees to determine applicability of information provided.
© 2019 Unicode, Inc.
All rights reserved. This publication is protected by copyright, and permission must be obtained from
the publisher prior to any prohibited reproduction. For information regarding permissions, inquire
at http://www.unicode.org/reporting.html. For information about the Unicode terms of use, please
see http://www.unicode.org/copyright.html.
The Unicode Standard / the Unicode Consortium; edited by the Unicode Consortium. — Version
12.0.
Includes index.
ISBN 978-1-936213-22-1 (http://www.unicode.org/versions/Unicode12.0.0/)
1. Unicode (Computer character set) I. Unicode Consortium.
QA268.U545 2019
ISBN 978-1-936213-22-1
Published in Mountain View, CA
March 2019
445
Chapter 12
South and Central Asia-I 12
Official Scripts of India
The scripts of South Asia share so many common features that a side-by-side comparison
of a few will often reveal structural similarities even in the modern letterforms. With minor
historical exceptions, they are written from left to right. They are all abugidas in which
most symbols stand for a consonant plus an inherent vowel (usually the sound /a/). Word-
initial vowels in many of these scripts have distinct symbols, and word-internal vowels are
usually written by juxtaposing a vowel sign in the vicinity of the affected consonant.
Absence of the inherent vowel, when that occurs, is frequently marked with a special sign.
In the Unicode Standard, this sign is denoted by the Sanskrit word virZma. In some lan-
guages, another designation is preferred. In Hindi, for example, the word hal refers to the
character itself, and halant refers to the consonant that has its inherent vowel suppressed;
in Tamil, the word pukki is used. The virama sign nominally serves to suppress the inherent
vowel of the consonant to which it is applied; it is a combining character, with its shape
varying from script to script.
Most of the scripts of South Asia, from north of the Himalayas to Sri Lanka in the south,
from Pakistan in the west to the easternmost islands of Indonesia, are derived from the
ancient Brahmi script. The oldest lengthy inscriptions of India, the edicts of Ashoka from
the third century bce, were written in two scripts, Kharoshthi and Brahmi. These are both
ultimately of Semitic origin, probably deriving from Aramaic, which was an important
administrative language of the Middle East at that time. Kharoshthi, written from right to
left, was supplanted by Brahmi and its derivatives. The descendants of Brahmi spread with
myriad changes throughout the subcontinent and outlying islands. There are said to be
some 200 different scripts deriving from it. By the eleventh century, the modern script
known as Devanagari was in ascendancy in India proper as the major script of Sanskrit lit-
erature.
The North Indian branch of scripts was, like Brahmi itself, chiefly used to write Indo-Euro-
pean languages such as Pali and Sanskrit, and eventually the Hindi, Bengali, and Gujarati
languages, though it was also the source for scripts for non-Indo-European languages such
as Tibetan, Mongolian, and Lepcha.
South and Central Asia-I 446
The South Indian scripts are also derived from Brahmi and, therefore, share many struc-
tural characteristics. These scripts were first used to write Pali and Sanskrit but were later
adapted for use in writing non-Indo-European languages—namely, the languages of the
Dravidian family of southern India and Sri Lanka. Because of their use for Dravidian lan-
guages, the South Indian scripts developed many characteristics that distinguish them
from the North Indian scripts. South Indian scripts were also exported to southeast Asia
and were the source of scripts such as Tai Tham (Lanna) and Myanmar, as well as the insu-
lar scripts of the Philippines and Indonesia.
The shapes of letters in the South Indian scripts took on a quite distinct look from the shapes
of letters in the North Indian scripts. Some scholars suggest that this occurred because writ-
ing materials such as palm leaves encouraged changes in the way letters were written.
The major official scripts of India proper, including Devanagari, are documented in this
chapter. They are all encoded according to a common plan, so that comparable characters
are in the same order and relative location. This structural arrangement, which facilitates
transliteration to some degree, is based on the Indian national standard (ISCII) encoding
for these scripts.
The first six columns in each script are isomorphic with the ISCII-1988 encoding, except
that the last 11 positions (U+0955.. U+095F in Devanagari, for example), which are unas-
signed or undefined in ISCII-1988, are used in the Unicode encoding. The seventh column
in each of these scripts, along with the last 11 positions in the sixth column, represent addi-
tional character assignments in the Unicode Standard that are matched across some or all
of the scripts. For example, positions U+xx66..U+xx6F and U+xxE6 ..U+xxEF code the
Indic script digits for each script. The eighth column for each script is reserved for script-
specific additions that do not correspond from one Indic script to the next.
While the arrangement of the encoding for the scripts of India is based on ISCII, this does
not imply that the rendering behavior of South Indian scripts in particular is the same as
that of Devanagari or other North Indian scripts. Implementations should ensure that ade-
quate attention is given to the actual behavior of those scripts; they should not assume that
they work just as Devanagari does. Each block description in this chapter describes the
most important aspects of rendering for a particular script as well as unique behaviors it
may have.
Many of the character names in this group of scripts represent the same sounds, and com-
mon naming conventions are used for the scripts of India.
South and Central Asia-I 447 12.1 Devanagari
12.1 Devanagari
Devanagari: U+0900–U+097F
The Devanagari script is used for writing classical Sanskrit and its modern historical deriv-
ative, Hindi. Extensions to the Sanskrit repertoire are used to write other related languages
of India (such as Marathi) and of Nepal (Nepali). In addition, the Devanagari script is used
to write the following languages: Awadhi, Bagheli, Bhatneri, Bhili, Bihari, Braj Bhasha,
Chhattisgarhi, Garhwali, Gondi (Betul, Chhindwara, and Mandla dialects), Harauti, Ho,
Jaipuri, Kachchhi, Kanauji, Konkani, Kului, Kumaoni, Kurku, Kurukh, Marwari, Mundari,
Newari, Palpa, and Santali.
All other Indic scripts, as well as the Sinhala script of Sri Lanka, the Tibetan script, and the
Southeast Asian scripts, are historically connected with the Devanagari script as descen-
dants of the ancient Brahmi script. The entire family of scripts shares a large number of
structural features.
The principles of the Indic scripts are covered in some detail in this introduction to the
Devanagari script. The remaining introductions to the Indic scripts are abbreviated but
highlight any differences from Devanagari where appropriate.
Standards. The Devanagari block of the Unicode Standard is based on ISCII-1988 (Indian
Script Code for Information Interchange). The ISCII standard of 1988 differs from and is
an update of earlier ISCII standards issued in 1983 and 1986.
The Unicode Standard encodes Devanagari characters in the same relative positions as
those coded in positions A0–F416 in the ISCII-1988 standard. The same character code lay-
out is followed for eight other Indic scripts in the Unicode Standard: Bengali, Gurmukhi,
Gujarati, Oriya, Tamil, Telugu, Kannada, and Malayalam. This parallel code layout
emphasizes the structural similarities of the Brahmi scripts and follows the stated intention
of the Indian coding standards to enable one-to-one mappings between analogous coding
positions in different scripts in the family. Sinhala, Tibetan, Thai, Lao, Khmer, Myanmar,
and other scripts depart to a greater extent from the Devanagari structural pattern, so the
Unicode Standard does not attempt to provide any direct mappings for these scripts to the
Devanagari order.
In November 1991, at the time The Unicode Standard, Version 1.0, was published, the
Bureau of Indian Standards published a new version of ISCII in Indian Standard (IS)
13194:1991. This new version partially modified the layout and repertoire of the ISCII-
1988 standard. Because of these events, the Unicode Standard does not precisely follow the
layout of the current version of ISCII. Nevertheless, the Unicode Standard remains a sup-
erset of the ISCII-1991 repertoire. Modern, non-Vedic texts encoded with ISCII-1991 may
be automatically converted to Unicode code points and back to their original encoding
without loss of information. The Vedic extension characters defined in IS 13194:1991
Annex G—Extended Character Set for Vedic are now fully covered by the Unicode Standard,
but the conversions between ISCII and Unicode code points in some cases are more com-
plex than for modern texts.
South and Central Asia-I 448 12.1 Devanagari
Encoding Principles. The writing systems that employ Devanagari and other Indic scripts
constitute abugidas—a cross between syllabic writing systems and alphabetic writing sys-
tems. The effective unit of these writing systems is the orthographic syllable, consisting of a
consonant and vowel (CV) core and, optionally, one or more preceding consonants, with a
canonical structure of (((C)C)C)V. The orthographic syllable need not correspond exactly
with a phonological syllable, especially when a consonant cluster is involved, but the writ-
ing system is built on phonological principles and tends to correspond quite closely to pro-
nunciation.
The orthographic syllable is built up of alphabetic pieces, the actual letters of the Devana-
gari script. These pieces consist of three distinct character types: consonant letters, inde-
pendent vowels, and dependent vowel signs. In a text sequence, these characters are stored
in logical (phonetic) order. Consonant letters by themselves constitute a CV unit, where the
V is an inherent vowel, whose exact phonetic value may vary by writing system. Indepen-
dent vowels also constitute a CV unit, where the C is considered to be null.
A dependent vowel sign is used to represent a V in CV units where C is not null and V is not
the inherent vowel. CV units are not represented by sequences of a consonant followed by
virama followed by independent vowel. In some cases, a phonological diphthong (such as
Hindi 012 /jQo/) is actually written as two orthographic CV units, where the second of
these units is an independent vowel letter, whose C is considered to be null.
Some Devanagari consonant letters have alternative presentation forms whose choice
depends on neighboring consonants. This variability is especially notable for U+0930
devanagari letter ra, which has numerous different forms, both as the initial element
and as the final element of a consonant cluster. Only the nominal forms, rather than the
contextual alternatives, are depicted in the code charts.
The traditional Sanskrit/Devanagari alphabetic encoding order for consonants follows
articulatory phonetic principles, starting with velar consonants and moving forward to
bilabial consonants, followed by liquids and then fricatives. ISCII and the Unicode Stan-
dard both observe this traditional order.
Independent Vowel Letters. The independent vowels in Devanagari are letters that stand
on their own. The writing system treats independent vowels as orthographic CV syllables in
which the consonant is null. The independent vowel letters are used to write syllables that
start with a vowel.
Dependent Vowel Signs (Matras). The dependent vowels serve as the common manner of
writing noninherent vowels and are generally referred to as vowel signs, or as matras in
Sanskrit. The dependent vowels do not stand alone; rather, they are visibly depicted in
combination with a base letterform. A single consonant or a consonant cluster may have a
dependent vowel applied to it to indicate the vowel quality of the syllable, when it is differ-
ent from the inherent vowel. Explicit appearance of a dependent vowel in a syllable over-
rides the inherent vowel of a single consonant letter.
The greatest variation among different Indic scripts is found in the way that the dependent
vowels are applied to base letterforms. Devanagari has a collection of nonspacing depen-
dent vowel signs that may appear above or below a consonant letter, as well as spacing
dependent vowel signs that may occur to the right or to the left of a consonant letter or
consonant cluster. Other Indic scripts generally have one or more of these forms, but what
is a nonspacing mark in one script may be a spacing mark in another. Also, some of the
Indic scripts have single dependent vowels that are indicated by two or more glyph compo-
nents—and those glyph components may surround a consonant letter both to the left and
to the right or may occur both above and below it.
In modern usage the Devanagari script has only one character denoting a left-side depen-
dent vowel sign: U+093F devanagari vowel sign i. In the historic Prishthamatra orthog-
raphy, Devanagari also made use of one additional left-side dependent vowel sign: U+094E
devanagari vowel sign prishthamatra e. Other Indic scripts either have no such vowel
signs (Telugu and Kannada) or include as many as three of these signs (Bengali, Tamil, and
Malayalam).
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-1 shows vowel letters that can be
analyzed, the single code point that should be used to represent them in text, and the
sequence of code points resulting from analysis that should not be used.
South and Central Asia-I 450 12.1 Devanagari
Virama (Halant). Devanagari employs a sign known in Sanskrit as the virama or vowel
omission sign. In Hindi, it is called hal or halant, and that term is used in referring to the
virama or to a consonant with its vowel suppressed by the virama. The terms are used
interchangeably in this section.
The virama sign, U+094D devanagari sign virama, nominally serves to cancel (or kill)
the inherent vowel of the consonant to which it is applied. When a consonant has lost its
inherent vowel by the application of virama, it is known as a dead consonant; in contrast, a
live consonant is one that retains its inherent vowel or is written with an explicit dependent
vowel sign. In the Unicode Standard, a dead consonant is defined as a sequence consisting
of a consonant letter followed by a virama. The default rendering for a dead consonant is to
position the virama as a combining mark bound to the consonant letterform.
For example, if Cn denotes the nominal form of consonant C, and Cd denotes the dead con-
sonant form, then a dead consonant is encoded as shown in Figure 12-1.
South and Central Asia-I 451 12.1 Devanagari
à + † → Æ
It could be assumed that a dead consonant may be combined with a vowel letter or sign to
represent a CV orthographic syllable. Some non-Unicode implementations have used this
approach; however, this is not done in implementations of the Unicode Standard. Instead,
a CV orthographic syllable is represented with a (live) consonant followed by a dependent
vowel. A dead consonant should not be followed either by an independent vowel letter or
by a dependent vowel sign in an attempt to create an alternative representation of a CV
orthographic syllable.
Atomic Representation of Consonant Letters. Consonant letters are encoded atomically
in Unicode, even if they can be analyzed visually as consisting of multiple parts. In particu-
lar, consonant half forms are dead-consonant forms that often resemble a full consonant
form minus a vertical stem. This vertical stem is visually similar to the vowel sign denoting
/ā/, U+093E devanagari vowel sign aa. Table 12-2 shows atomic consonant letters in
Devanagari that could be graphically analyzed this way, the single code point that should
be used to represent them in text, and the sequence of code points resulting from analysis
that should not be used.
ऩ 0929
<0929, 094D, 093E>, <0929, 094D, 200D, 093E>,
<0928, 093C, 094D, 093E>, <0928, 093C, 094D, 200D, 093E>
प 092A <092A, 094D, 093E>, <092A, 094D, 200D, 093E>
ख़ 0959
<0959, 094D, 093E>, <0959, 094D, 200D, 093E>,
<0916, 093C, 094D, 093E>, <0916, 093C, 094D, 200D, 093E>
ग़ 095A
<095A, 094D, 093E>, <095A, 094D, 200D, 093E>,
<0917, 093C, 094D, 093E>, <0917, 093C, 094D, 200D, 093E>
ज़ 095B
<095B, 094D, 093E>, <095B, 094D, 200D, 093E>,
<091C, 093C, 094D, 093E>, <091C, 093C, 094D, 200D, 093E>
य़ 095F
<095F, 094D, 093E>, <095F, 094D, 200D, 093E>,
<092F, 093C, 094D, 093E>, <092F, 093C, 094D, 200D, 093E>
ॹ 0979 <0979, 094D, 093E>, <0979, 094D, 200D, 093E>
The principle of using atomic consonant representations, rather than representations ana-
lyzing the consonant into a half form plus stem, also applies to other Indic scripts, such as
Gujarati and Bengali.
South and Central Asia-I 453 12.1 Devanagari
Consonant Conjuncts. The Indic scripts are noted for a large number of consonant con-
junct forms that serve as orthographic abbreviations (ligatures) of two or more adjacent
letterforms. This abbreviation takes place only in the context of a consonant cluster. An
orthographic consonant cluster is defined as a sequence of characters that represents one or
more dead consonants (denoted Cd) followed by a normal, live consonant letter (denoted
Cl).
Under normal circumstances, a consonant cluster is depicted with a conjunct glyph if such
a glyph is available in the current font. In the absence of a conjunct glyph, the one or more
dead consonants that form part of the cluster are depicted using half-form glyphs. In the
absence of half-form glyphs, the dead consonants are depicted using the nominal conso-
nant forms combined with visible virama signs (see Figure 12-2).
ª† + œ → Çœ ∑† + · → S
₍2₎ KAd + KAl → K.KAn ₍4₎ RAd + KAl → KAl + RAsup
∑† + ∑→ P ⁄† +∑ → ∑F
प् + स् + य
A well-designed Indic script font may contain hundreds of conjunct glyphs, but they are
not encoded as Unicode characters because they are the result of ligation of distinct letters.
South and Central Asia-I 454 12.1 Devanagari
Indic script rendering software must be able to map appropriate combinations of charac-
ters in context to the appropriate conjunct glyphs in fonts.
A dead consonant conjunct may have an appearance like a half form, because the vertical
stem of the last consonant is removed. As a result, a live consonant conjunct could be ana-
lyzed visually as consisting of the dead, consonant-conjunct half form plus the vowel sign
/ā/. As in the case of consonant letters, the live form should not be represented using a half
form followed by U+093E devanagari vowel sign aa. Table 12-3 shows some examples
of live consonant conjuncts that exhibit this visual pattern, but that should not be repre-
sented with fully analyzed sequences. Table 12-3 also shows the sequence of code points
that should be used to represent these conjuncts in text, and the sequence of code points
resulting from analysis that should not be used.
Note that these are illustrative examples only. There are many consonant conjuncts that
could be visually analyzed in the same way, and the same principle applies to all such cases:
these should not be represented as dead conjunct plus vowel sign sequences. The principle
of using atomic consonant representations, rather than representations analyzing the con-
sonant into a half form plus stem, also applies to other Indic scripts, such as Gujarati and
Bengali.
Explicit Virama (Halant). Normally a virama character serves to create dead consonants
that are, in turn, combined with subsequent consonants to form conjuncts. This behavior
usually results in a virama sign not being depicted visually. Occasionally, this default
behavior is not desired when a dead consonant should be excluded from conjunct forma-
tion, in which case the virama sign is visibly rendered. To accomplish this goal, the Uni-
code Standard adopts the convention of placing the character U+200C zero width non-
joiner immediately after the encoded dead consonant that is to be excluded from conjunct
formation. In this case, the virama sign is always depicted as appropriate for the consonant
to which it is attached.
For example, in Figure 12-4, the use of zero width non-joiner prevents the default for-
mation of the conjunct form S (K.SSAn).
Explicit Half-Consonants. When a dead consonant participates in forming a conjunct, the
dead consonant form is often absorbed into the conjunct form, such that it is no longer dis-
South and Central Asia-I 455 12.1 Devanagari
∑† + à + · → ∑†·
tinctly visible. In other contexts, the dead consonant may remain visible as a half-consonant
form. In general, a half-consonant form is distinguished from the nominal consonant form
by the loss of its inherent vowel stem, a vertical stem appearing to the right side of the con-
sonant form. In other cases, the vertical stem remains but some part of its right-side geom-
etry is missing.
In certain cases, it is desirable to prevent a dead consonant from assuming full conjunct
formation yet still not appear with an explicit virama. In these cases, the half-form of the
consonant is used. To explicitly encode a half-consonant form, the Unicode Standard
adopts the convention of placing the character U+200D zero width joiner immediately
after the encoded dead consonant. The zero width joiner denotes a nonvisible letter that
presents linking or cursive joining behavior on either side (that is, to the previous or fol-
lowing letter). Therefore, in the present context, the zero width joiner may be consid-
ered to present a context to which a preceding dead consonant may join so as to create the
half-form of the consonant.
For example, if Ch denotes the half-form glyph of consonant C, then a half-consonant form
is represented as shown in Figure 12-5.
∑† + Ä + · → Ä·
In the absence of the zero width joiner, the sequence in Figure 12-5 would normally pro-
duce the full conjunct form S (K.SSAn).
This encoding of half-consonant forms also applies in the absence of a base letterform.
That is, this technique may be used to encode independent half-forms, as shown in
Figure 12-6.
ª† + Ä → Ç
South and Central Asia-I 456 12.1 Devanagari
Other Indic scripts have similar half-forms for the initial consonants of a conjunct. Some,
such as Oriya, also have similar half-forms for the final consonants; those are represented
as shown in Figure 12-7.
As the rendering of conjuncts and half-forms depends on the availability of glyphs in the
font, the following fallback strategy should be employed:
• If the coded character sequence would normally render with a full conjunct,
but such a conjunct is not available, the fallback rendering is to use half-forms.
If those are not available, the fallback rendering should use an explicit (visible)
virama.
• If the coded character sequence would normally render with a half-form (it
contains a ZWJ), but half-forms are not available, the fallback rendering should
use an explicit (visible) virama.
South and Central Asia-I 457 12.1 Devanagari
Rendering Devanagari
Rules for Rendering. This section provides more formal and detailed rules for minimal
rendering of Devanagari as part of a plain text sequence. It describes the mapping between
Unicode characters and the glyphs in a Devanagari font. It also describes the combining
and ordering of those glyphs.
These rules provide minimal requirements for legibly rendering interchanged Devanagari
text. As with any script, a more complex procedure can add rendering characteristics,
depending on the font and application.
In a font that is capable of rendering Devanagari, the number of glyphs is
greater than the number of Devanagari characters.
Notation. In the next set of rules, the following notation applies:
Cn Nominal glyph form of consonant C as it appears in the code
charts.
Cl A live consonant, depicted identically to Cn.
Cd Glyph depicting the dead consonant form of consonant C.
Ch Glyph depicting the half-consonant form of consonant C.
Ln Nominal glyph form of a conjunct ligature consisting of two or
more component consonants. A conjunct ligature composed of
two consonants X and Y is also denoted X.Yn.
RAsup A nonspacing combining mark glyph form of U+0930 devana-
gari letter ra positioned above or attached to the upper part
of a base glyph form. This form is also known as repha.
RAsub A nonspacing combining mark glyph form of U+0930 devana-
gari letter ra positioned below or attached to the lower part
of a base glyph form.
Vvs Glyph depicting the dependent vowel sign form of a vowel V.
VIRAMAn The nominal glyph form of the nonspacing combining mark
depicting U+094D devanagari sign virama.
A virama character is not always depicted. When it is depicted, it adopts this nonspacing
mark form.
Dead Consonant Rule. The following rule logically precedes the application of any other
rule to form a dead consonant. Once formed, a dead consonant may be subject to other
rules described next.
South and Central Asia-I 458 12.1 Devanagari
à + † → Æ
Consonant RA Rules. The character U+0930 devanagari letter ra takes one of a num-
ber of visual forms depending on its context in a consonant cluster. By default, this letter is
depicted with its nominal glyph form (as shown in the code charts). In some contexts, it is
depicted using one of two nonspacing glyph forms that combine with a base letterform.
R2 If the dead consonant RAd precedes a consonant, then it is replaced by the super-
script nonspacing mark RAsup , which is positioned so that it applies to the logically
subsequent element in the memory representation.
⁄† + ∑ → ∑+ F → ∑F
RAd + RAd → RAd + RAsup
1 2 2 1
⁄† + ⁄† → ⁄† + F → ⁄†Z
R3 If the superscript mark RAsup is to be applied to a dead consonant and that dead
consonant is combined with another consonant to form a conjunct ligature, then
the mark is positioned so that it applies to the conjunct ligature form as a whole.
⁄† + ¡† + ƒ → Æ + F → ÆF
R4 If the superscript mark RAsup is to be applied to a dead consonant that is subse-
quently replaced by its half-consonant form, then the mark is positioned so that it
applies to the form that serves as the base of the consonant cluster.
⁄† + ª† + Ω → Ç + Ω + F → ÇΩ F
South and Central Asia-I 459 12.1 Devanagari
R5 In conformance with the ISCII standard, the half-consonant form RRAh is repre-
sented as eyelash-RA. This form of RA is commonly used in writing Marathi and
Newari.
RRAn + VIRAMAn → RRAh
⁄. + † → :
R5a For compatibility with The Unicode Standard, Version 2.0, if the dead consonant
RAd precedes zero width joiner, then the half-consonant form RAh , depicted as
eyelash-RA, is used instead of RAsup .
⁄† +Ä → :
R6 Except for the dead consonant RAd , when a dead consonant Cd precedes the live
consonant RAl , then Cd is replaced with its nominal form Cn , and RA is replaced by
the subscript nonspacing mark RAsub , which is positioned so that it applies to Cn.
∆† + ⁄ → ∆ + ˛ → ∆˛
R7 For certain consonants, the mark RAsub may graphically combine with the conso-
nant to form a conjunct ligature form. These combinations, such as the one shown
here, are further addressed by the ligature rules described shortly.
”† + ⁄ → ” + ˛ → p
R8 If a dead consonant (other than RAd ) precedes RAd , then the substitution of RA for
RAsub is performed as described above; however, the VIRAMA that formed RAd
remains so as to form a dead consonant conjunct form.
Æ + ⁄† → à + ˛ + † → d†
South and Central Asia-I 460 12.1 Devanagari
A dead consonant conjunct form that contains an absorbed RAd may subsequently
combine to form a multipart conjunct form.
d† + ÿ → òÿ
Modifier Mark Rules. In addition to vowel signs, three other types of combining marks
may be applied to a component of an orthographic syllable or to the syllable as a whole:
nukta, bindus, and svaras.
R9 The nukta sign, which modifies a consonant form, is placed immediately after the
consonant in the memory representation and is attached to that consonant in ren-
dering. If the consonant represents a dead consonant, then NUKTA should precede
VIRAMA in the memory representation.
∑ + . + † → ∏∑†
R10 Other modifying marks, in particular bindus and svaras, apply to the
orthographic syllable as a whole and should follow (in the memory representa-
tion) all other characters that constitute the syllable. The bindus should follow any
vowel signs, and the svaras should come last. The relative placement of these
marks is horizontal rather than vertical; the horizontal rendering order may vary
according to typographic concerns.
KAn + AAvs + CANDRABINDUn
∑ + Ê + ° → ∑Ê °
Ligature Rules. Subsequent to the application of the rules just described, a set of rules gov-
erning ligature formation apply. The precise application of these rules depends on the
availability of glyphs in the current font being used to display the text.
R11 If a dead consonant immediately precedes another dead consonant or a live conso-
nant, then the first dead consonant may join the subsequent element to form a
two-part conjunct ligature form.
¡† + ƒ → Æ ≈† + ∆ → _
South and Central Asia-I 461 12.1 Devanagari
R12 A conjunct ligature form can itself behave as a dead consonant and enter into fur-
ther, more complex ligatures.
‚† + Æ + ⁄ → ‚† + d → ñd
A conjunct ligature form can also produce a half-form.
S† + ÿ → óÿ
R13 If a nominal consonant or conjunct ligature form precedes RAsub as a result of the
application of rule R6, then the consonant or ligature form may join with RAsub to
form a multipart conjunct ligature (see rule R6 for more information).
∑ + ˛ → R ” + ˛ → p
R14 In some cases, other combining marks will combine with a base consonant, either
attaching at a nonstandard location or changing shape. In minimal rendering,
there are only two cases: RAl with Uvs or UUvs .
⁄ + G → L ⁄ + H → M
Memory Representation and Rendering Order. The storage of plain text in Devanagari
and all other Indic scripts generally follows phonetic order; that is, a CV syllable with a
dependent vowel is always encoded as a consonant letter C followed by a vowel sign V in
the memory representation. This order is employed by the ISCII standard and corresponds
to both the phonetic order and the keying order of textual data (see Figure 12-9).
∑ +Á → Á∑
South and Central Asia-I 462 12.1 Devanagari
Because Devanagari and other Indic scripts have some dependent vowels that must be
depicted to the left side of their consonant letter, the software that renders the Indic scripts
must be able to reorder elements in mapping from the logical (character) store to the pre-
sentational (glyph) rendering. For example, if Cn denotes the nominal form of consonant
C, and Vvs denotes a left-side dependent vowel sign form of vowel V, then a reordering of
glyphs with respect to encoded characters occurs as just shown.
R15 When the dependent vowel Ivs is used to override the inherent vowel of a syllable, it
is always written to the extreme left of the orthographic syllable. If the
orthographic syllable contains a consonant cluster, then this vowel is always
depicted to the left of that cluster.
Æ + ⁄ +Á → d +Á → Ád
R16 The presence of an explicit virama (either caused by a ZWNJ or by the absence of a
conjunct in the font) blocks this reordering, and the dependent vowel Ivs is ren-
dered after the rightmost such explicit virama.
§ + Ã + ⁄ + Á →F
Alternative Forms of Cluster-Initial RA. In addition to reph (rule R2) and eyelash (rule
R5a), a cluster-initial RA may also take its nominal form while the following consonant
takes a reduced form. This behavior is required by languages that make a morphological
distinction between “reph on YA” and “RA with reduced YA”, such as Braj Bhasha. To trig-
ger this behavior, a ZWJ is placed immediately before the virama to request a reduced form
of the following consonant, while preventing the formation of reph, as shown in the third
example below.
$्
र य
$्
र य
$्
र य र
Similar, special rendering behavior of cluster-initial RA is noted in other scripts of India.
See, for example, “Interaction of Repha and Ya-phalaa” in Section 12.2, Bengali (Bangla),
“Reph” in Section 12.7, Telugu, and “Consonant Clusters Involving RA” in Section 12.8,
Kannada.
South and Central Asia-I 463 12.1 Devanagari
Sample Half-Forms. Table 12-4 shows examples of half-consonant forms that are com-
monly used with the Devanagari script. These forms are glyphs, not characters. They may
be encoded explicitly using zero width joiner as shown. In normal conjunct formation,
they may be used spontaneously to depict a dead consonant in combination with subse-
quent consonant forms.
∑+ 0 + Ä → Ä –+ 0 + Ä → ã
π+ 0 + Ä → Å “+ 0 + Ä → å
ª+ 0 + Ä → Ç ”+ 0 + Ä → ç
Ω+ 0 + Ä → É ’+ 0 + Ä → é
ø+ 0 + Ä → Ñ ÷+ 0 + Ä → è
¡+ 0 + Ä → Ö ◊+ 0 + Ä → ê
√+ 0 + Ä → ß ÿ+ 0 + Ä → ë
ƒ+ 0 + Ä → Ü ‹+ 0 + Ä → í
À+ 0 + Ä → á fl + 0 + Ä → ì
Ã+ 0 + Ä → à ‡+ 0 + Ä → î
Õ+ 0 + Ä → â ·+ 0 + Ä → ï
œ+ 0 + Ä → ä ‚+ 0 + Ä → ñ
Sample Ligatures. Table 12-5 shows examples of conjunct ligature forms that are com-
monly used with the Devanagari script. These forms are glyphs, not characters. Not every
writing system that employs this script uses all of these forms; in particular, many of these
forms are used only in writing Sanskrit texts. Furthermore, individual fonts may provide
fewer or more ligature forms than are depicted here.
South and Central Asia-I 464 12.1 Devanagari
∑+ 0 + ∑→ P ≈+ 0 + ∆ → _
∑+ 0 + Ã→ Q ∆+ 0 + ∆ → n
∑+ 0 + ⁄ → R «+ 0 + ª → `
∑+ 0 + ·→ S «+ 0 + « → a
æ+ 0 + ∑→ V «+ 0 + … → b
æ+ 0 + π→ W Ã+ 0 + Ã → c
æ+ 0 + ª→ X Ã+ 0 + ⁄ → d
æ+ 0 + Ω→ Y –+ 0 + – → Ÿ
ƒ+ 0 + ¡→ ¨ ”+ 0 + ⁄ → p
¡+ 0 + ƒ→ Æ ‡+ 0 + ⁄ → o
Œ+ 0 + Ω→ f „+ 0 + ◊ → r
Œ+ 0 + Œ→ g „+ 0 + ÿ → s
Œ+ 0 + œ→ h „+ 0 + ‹ → t
Œ+ 0 + ’→ i „+ 0 + fl → u
Œ+ 0 + ÷→ j „+ A → N
Œ+ 0 + ◊→ k ⁄ + B → L
Œ+ 0 + ÿ→ l ⁄ + C → M
Œ+ 0 + fl → m ‚+ 0 + d → ù
≈+ 0 + ≈→ ^
South and Central Asia-I 465 12.1 Devanagari
r + a → i or b
r + c → j or d
r + e → k or f
r + g → m or h
The graphical forms displayed above with the reph (RAsup) should not be represented by
sequences of RA + virama + independent vowel, as such sequences violate the general
encoding principles of the script. CV orthographic syllables are not represented by conso-
nant + virama + independent vowel.
The practice of writing these phonological sequences as a reph on an independent vocalic
liquid letter is also observed in other Indic scripts, such as Bengali, Gujarati, Oriya, Telugu,
Kannada, and Bhaiksuki.
Sample Half-Ligature Forms. In addition to half-form glyphs of individual consonants,
half-forms are used to depict conjunct ligature forms. A sample of such forms is shown in
Table 12-7. These forms are glyphs, not characters. They may be encoded explicitly using
zero width joiner as shown. In normal conjunct formation, they may be used sponta-
neously to depict a conjunct ligature in combination with subsequent consonant forms.
∑+ 0 + ·+ 0 + Ä → ó
¡+ 0 + ƒ+ 0 + Ä → ô
Ã+ 0 + Ã+ 0 + Ä → û
Ã+ 0 + ⁄+ 0 + Ä → ò
‡+ 0 + ⁄+ 0 + Ä → ü
South and Central Asia-I 466 12.1 Devanagari
gali, Gujarati, and so on. However, analogous punctuation marks for other Brahmi-derived
scripts are separately encoded, particularly for scripts used primarily outside of India.
Many modern languages written in the Devanagari script intersperse punctuation derived
from the Latin script. Thus U+002C comma and U+002E full stop are freely used in writ-
ing Hindi, and the danda is usually restricted to more traditional texts. However, the
danda may be preserved when such traditional texts are transliterated into the Latin script.
Other Symbols. U+0970 3 devanagari abbreviation sign appears after letters or combi-
nations of letters and marks the sequence as an abbreviation. It is intended specifically for
Devanagari script-based abbreviations, such as the Devanagari rupee sign. Other symbols
and signs most commonly occurring in Vedic texts are encoded in the Devanagari
Extended and Vedic Extensions blocks and are discussed in the text that follows.
The svasti (or well-being) signs often associated with the Hindu, Buddhist, and Jain tradi-
tions are encoded in the Tibetan block. See Section 13.4, Tibetan for further information.
Example Meaning
तला sole
तलाऽ pond
Letters for Bihari Languages. A number of the Devanagari vowel letters have been used to
write the Bihari languages Bhojpuri, Magadhi, and Maithili, as listed in Table 12-9.
Letter Short a. The character U+0904 devanagari letter short a is used to denote a
short e in the Awadi language, an Indo-Aryan language spoken in the north Indian state of
Uttar Pradesh and southern Nepal. A publisher in Lucknow, Uttar Pradesh also uses it in
Hindi translations and Devanagari transliterations of the Kannada, Telugu, Tamil, Malay-
alam and Kashmiri languages.
Prishthamatra Orthography. In the historic Prishthamatra orthography, the vowel signs
for e, ai, o, and au are represented using U+094E devanagari vowel sign prishthama-
tra e (which goes on the left side of the consonant) alone or in combination with one of
U+0947 devanagari vowel sign e, U+093E devanagari vowel sign aa or U+094B
devanagari vowel sign o. Table 12-10 shows those combinations applied to ka. In the
underlying representation of text, U+094E should be first in the sequence of dependent
vowel signs after the consonant, and may be followed by U+0947, U+093E or U+094B.
encoded as a series of combining digits, alphabetic characters, and avagraha in the range
U+A8E0..U+A8F1.
Cantillation Marks for the SZmaveda. One of the four major Vedic texts is SZmaveda. The
text is both recited (SZmaveda-SaZhitZ) and sung (SZmagZna), and is marked differently
for the purposes of each. Cantillation marks are used to indicate length, tone, and other
features in the recited text of SZmaveda, and in the Kauthuma and RQNQyanSya traditions of
SZmagZna. These marks are encoded as a series of combining digits, alphabetic characters,
and avagraha in the range U+A8E0..U+A8F1. The marks are rendered directly over the
base letter. They are represented in text immediately after the syllable they modify.
In certain cases, two marks may occur over a letter: U+A8E3 combining devanagari
digit three may be followed by U+A8EC combining devanagari letter ka, for exam-
ple. Although no use of U+A8E8 combining devanagari digit eight has been found in
the SZmagZna, it is included to provide a complete set of 0–9 digits. The combining marks
encoded for the SZmaveda do not include characters that may appear as subscripts and
superscripts in the JaiminSya tradition of SZmagZna, which used interlinear annotation.
Interlinear annotation may be rendered using Ruby and may be represented by means of
markup or other higher-level protocols.
Nasalization Marks. The set of spacing marks in the range U+A8F2..U+A8F7 include the
term candrabindu in their names and indicate nasalization. These marks are all aligned
with the headline. Note that U+A8F2 devanagari sign spacing candrabindu is lower
than the U+0901 devanagari sign candrabindu.
Editorial Marks. A set of editorial marks is encoded in the range U+A8F8..U+A8FB for use
with Devanagari. U+A8F9 devanagari gap filler signifies an intentional gap that would
ordinarily be filled with text. In contrast, U+A8FB devanagari headstroke indicates
illegible gaps in the original text. The glyph for devanagari headstroke should be
designed so that it does not connect to the headstroke of the letters beside it, which will
make it possible to indicate the number of illegible syllables in a given space. U+A8F8
devanagari sign pushpika acts as a filler in text, and is commonly flanked by double dan-
das. U+A8FA devanagari caret, a zero-width spacing character, marks the insertion
point of omitted text, and is placed at the insertion point between two orthographic sylla-
bles. It can also be used to indicate word division.
which a pause is disallowed. The block also contains several Vedic signs for ardhavisarga,
jihvamuliya, upadhmaniya and atikrama.
Tone Marks. The Vedic tone marks are all combining marks. The tone marks are grouped
together in the code charts based upon the tradition in which they appear: they are used in
the four core texts of the Vedas (SZmaveda, Yajurveda, Rigveda, and Atharvaveda) and in
the prose text on Vedic ritual (YatapathabrZhmaDa). The character U+1CD8 vedic tone
candra below is also used to identify the short vowels e and o. In this usage, the pre-
scribed order is the Indic syllable (aksara), followed by U+1CD8 vedic tone candra
below and the tone mark (svara). When a tone mark is placed below, it appears below the
vedic tone candra below.
In addition to the marks encoded in this block, Vedic texts may use other nonspacing
marks from the General Diacritics block and other blocks. For example, U+20F0 combin-
ing asterisk above would be used to represent a mark of that shape above a Vedic letter.
Diacritics for the Visarga. A set of combining marks that serve as diacritics for the visarga
is encoded in the range U+1CE2..U+1CE8. These marks indicate that the visarga has a par-
ticular tone. For example, the combination U+0903 devanagari sign visarga plus
U+1CE2 vedic sign visarga svarita represents a svarita visarga. The upward-shaped
diacritic is used for the udZtta (high-toned), the downward-shaped diacritic for anudZtta
(low-toned), and the midline glyph indicates the svarita (modulated tone).
In Vedic manuscripts the tonal mark (that is, the horizontal bar, upward curve and down-
ward curve) appears in colored ink, while the two dots of the visarga appear in black ink.
The characters for accents can be represented using separate characters, to make it easier
for color information to be maintained by means of markup or other higher-level proto-
cols.
Nasalization Marks. A set of spacing marks and one combining mark, U+1CED vedic
sign tiryak, are encoded in the range U+1CE9..U+1CF1. They describe phonetic distinc-
tions in the articulation of nasals. The gomukha characters from U+1CE9..U+1CEC may
be combined with U+0902 devanagari sign anusvara or U+0901 devanagari sign
candrabindu. U+1CF1 vedic sign anusvara ubhayato mukha may indicate a visarga
with a tonal mark as well as a nasal. The three characters, U+1CEE vedic sign hexiform
long anusvara, U+1CEF vedic sign long anusvara, and U+1CF0 vedic sign rthang
long anusvara, are all synonymous and indicate a long anusvZra after a short vowel.
U+1CED vedic sign tiryak is the only combining character in this set of nasalization
marks. While it appears similar to the U+094D devanagari sign virama, it is used to ren-
der glyph variants of nasal marks that occur in manuscripts and printed texts.
Ardhavisarga. U+1CF2 vedic sign ardhavisarga is a character that marks either the jih-
vZm^l\ya, a velar fricative occurring only before the unvoiced velar stops ka and kha, or the
upadhmZn\ya, a bilabial fricative occurring only before the unvoiced labial stops pa and
pha. Ardhavisarga is a spacing character. It is represented in text in visual order before the
consonant it modifies.
South and Central Asia-I 472 12.2 Bengali (Bangla)
There is an exception to this general pattern for the representation of Bengali independent
vowel letters, for the Bengali script orthography of Kokborok, a major language of Tripura
state in Northeast India. Kokborok has diphthongs which can occur as initial letters. To
reflect existing practice, these diphthongs are represented with two character sequences,
rather than as atomic characters, as shown in Table 12-12. Rendering systems which sup-
port display of the Kokborok orthography need to be aware of these exceptional sequences.
The sequence for vowel letter aw uses U+09D7 bengali au length mark, also noted in
the following discussion of two-part vowel signs.
Two-Part Vowel Signs. The Bengali script, along with a number of other Indic scripts,
makes use of two-part dependent vowel signs. In these dependent vowels (matras) one-half
of the vowel is displayed on each side of a consonant letter or cluster—for example,
U+09CB bengali vowel sign o and U+09CC bengali vowel sign au. To provide com-
South and Central Asia-I 473 12.2 Bengali (Bangla)
patibility with existing implementations of the scripts that use two-part vowel signs, the
Unicode Standard explicitly encodes the right half of these vowel signs. For example,
U+09D7 bengali au length mark represents the right-half glyph component of
U+09CC bengali vowel sign au. In Bengali orthography, the au length mark is always
used in conjunction with the left part and does not have a meaning on its own.
Special Characters. U+09F2..U+09F9 are a series of Bengali additions for writing currency
and fractions.
Historic Characters. The characters vocalic rr, vocalic l and vocalic ll, both in their inde-
pendent and dependent forms (U+098C, U+09C4, U+09E0..U+09E3), are only used to
write Sanskrit words in the Bengali script.
Characters for Assamese. Assamese employs two letters not used for the Bengali language.
The Assamese letter ra is represented in Unicode by U+09F0 ৰ bengali letter ra with
middle diagonal, and the Assamese letter wa is represented by U+09F1 ৱ bengali let-
ter ra with lower diagonal.
Assamese uses a conjunct character called kssa. Although kssa is often considered a sepa-
rate letter of the alphabet, it is not separately encoded. The conjunct is represented by the
sequence <U+0995 b bengali letter ka, U+09CD d bengali sign virama, U+09B7 q
bengali letter ssa>. This same sequence is also used to represent the Bengali letter
khinya (or khiya).
Assamese uses two additional consonant-vowel ligatures formed with U+09F0 bengali
letter ra with middle diagonal, which are not used for the Bengali language. These
consonant-vowel ligatures are shown in the “ligated” column in Table 12-13.
Rendering Behavior. Like other Brahmic scripts in the Unicode Standard, Bengali uses the
hasant to form conjunct characters. For example, U+09B8 a bengali letter sa +
U+09CD d bengali sign virama + U+0995 b bengali letter ka yields the conjunct c
SKA. For general principles regarding the rendering of the Bengali script, see the rules for
rendering in Section 12.1, Devanagari.
South and Central Asia-I 474 12.2 Bengali (Bangla)
Consonant-Vowel Ligatures. Some Bengali consonant plus vowel combinations have two
distinct visual presentations. The first visual presentation is a traditional ligated form, in
which the vowel combines with the consonant in a novel way. In the second presentation,
the vowel is joined to the consonant but retains its nominal form, and the combination is
not considered a ligature. These consonant-vowel combinations are illustrated in
Table 12-14.
The ligature forms of these consonant-vowel combinations are traditional. They are used
in handwriting and some printing. The “non-ligated” forms are more common; they are
used in newspapers and are associated with modern typefaces. However, the traditional lig-
atures are preferred in some contexts.
No semantic distinctions are made in Bengali text on the basis of the two different presen-
tations of these consonant-vowel combinations. However, some users consider it import-
ant that implementations support both forms and that the distinction be representable in
plain text. This may be accomplished by using U+200D zero width joiner and U+200C
zero width non-joiner to influence ligature glyph selection. (See “Cursive Connection
and Ligatures” in Section 23.2, Layout Controls.) Joiners are rarely needed in this situation.
The rendered appearance will typically be the result of a font choice.
A given font implementation can choose whether to treat the ligature forms of the conso-
nant-vowel combinations as the defaults for rendering. If the non-ligated form is the
default, then ZWJ can be inserted to request a ligature, as shown in Figure 12-12.
B + $å → Bå
0997 09C1 ga + u
B + Ä + $å → |
0997 200D 09C1 ga + u ligature
South and Central Asia-I 475 12.2 Bengali (Bangla)
If the ligated form is the default for a given font implementation, then ZWNJ can be
inserted to block a ligature, as shown in Figure 12-13.
B + $å → |
0997 09C1 ga + u ligature
B +Ã + $å → Bå
0997 200C 09C1 ga + u
Khiya. The letter r, known as khiya or khinya, is often considered as a distinct letter of the
Bengla alphabet. However, it is not encoded separately. It is represented by the sequence
<U+0995 b bengali letter ka, U+09CD d bengali sign virama, U+09B7 q bengali
letter ssa>.
Khanda Ta. In Bengali, a dead consonant ta makes use of a special form, U+09CE bengali
letter khanda ta. This form is used in all contexts except where it is immediately fol-
lowed by one of the consonants: ta, tha, na, ba, ma, ya, or ra.
Khanda ta cannot bear a vowel matra or combine with a following consonant to form a
conjunct aksara. It can form a conjunct aksara only with a preceding dead consonant ra,
with the latter being displayed with a repha glyph placed on the khanda ta.
Versions of the Unicode Standard prior to Version 4.1 recommended that khanda ta be
represented as the sequence <U+09A4 bengali letter ta, U+09CD bengali sign
virama, U+200D zero width joiner> in all circumstances. U+09CE bengali letter
khanda ta should instead be used explicitly in newly generated text, but users are cau-
tioned that instances of the older representation may exist.
The Bengali syllable tta illustrates the usage of khanda ta when followed by ta. The syllable
tta is normally represented with the sequence <U+09A4 ta, U+09CD hasant, U+09A4 ta>.
That sequence will normally be displayed using a single glyph tta ligature, as shown in the
first example in Figure 12-14.
u +$
z+ u → t
09A4 09CD 09A4 ta-ta ligature
u +$
z+ +u → vu
09A4 09CD 200C 09A4 ta hasant ta
w + u → wu
09CE 09A4 khanda-ta ta
South and Central Asia-I 476 12.2 Bengali (Bangla)
It is also possible for the sequence <ta, hasant, ta> to be displayed with a full ta glyph com-
bined with a hasant glyph, followed by another full ta glyph vu. The choice of form actu-
ally displayed depends on the display engine, based on the availability of glyphs in the font.
The Unicode Standard also provides an explicit way to show the hasant glyph. To do so, a
zero width non-joiner is inserted after the hasant. That sequence is always displayed
with the explicit hasant, as shown in the second example in Figure 12-14.
When the syllable tta is written with a khanda ta, however, the character U+09CE bengali
letter khanda ta is used and no hasant is required, as khanda ta is already a dead conso-
nant. The rendering of khanda ta is illustrated in the third example in Figure 12-14.
Ya-phalaa. Ya-phalaa is a presentation form of U+09AF { bengali letter ya. Repre-
sented by the sequence <U+09CD z bengali sign virama, U+09AF { bengali letter
ya>, ya-phalaa has a special form |. When combined with U+09BE Ä} bengali vowel
sign aa, it is used for transcribing [æ] as in the “a” in the English word “bat.” The ya-pha-
laa appears in WXYZ [ræt] “rash,” which provides a minimal pair with WYZ [rat] “a whole
lot.”
Ya-phalaa can be applied to initial vowels as well:
x|} = <0985, 09CD, 09AF, 09BE> (a- hasant ya -aa)
y|} = <098F, 09CD, 09AF, 09BE> (e- hasant ya -aa)
If a candrabindu or other combining mark needs to be added in the sequence, it comes at
the end of the sequence. For example:
x|}H = <0985, 09CD, 09AF, 09BE, 0981> (a- hasant ya -aa candrabindu)
Further examples:
x + z + { + Ä} → x|}
y + z + { + Ä} → y|}
u + z + { + Ä} → u|}
Interaction of Repha and Ya-phalaa. The formation of the repha form is defined in
Section 12.1, Devanagari, “Rules for Rendering,” R2. Basically, the repha is formed when a
ra that has the inherent vowel killed by the hasant begins a syllable. This scenario is shown
in the following example:
[ + $à + X → XÞ as in @ XÞ (karma)
The ya-phalaa is a post-base form of ya and is formed when the ya is the final consonant of
a syllable cluster. In this case, the previous consonant retains its base shape and the hasant
is combined with the following ya. This scenario is shown in the following example:
@ + $à + Y → @ ó as in Uá@ ó (bakyô)
South and Central Asia-I 477 12.2 Bengali (Bangla)
[ + Ä + $à + Y → [ó
09B0 200D 09CD 09AF
When the first character of the cluster is not a ra, the ya-phalaa is the normal rendering of
a ya, and a ZWJ is not necessary but can be present. Such a convention would make it pos-
sible, for example, for input methods to consistently associate ya-phalaa with the sequence
<ZWJ, hasant, ya>.
Jihvamuliya and Upadhmaniya. In Bengali, the voiceless velar and bilabial fricatives are
represented by U+1CF5 x vedic sign jihvamuliya and U+1CF6 y vedic sign upadh-
maniya, respectively. When the signs appear with a following homorganic voiceless stop
consonant, they can be rendered in a font as a stacked ligature without a virama:
ᳵ @ @
ᳶ S S
The sequences can also be represented linearly by inserting a U+200C zero width non-
joiner after the jihvamuliya or upadhmaniya, but before the following consonant:
ᳵ Ã @ @
ᳶ Ã S S
Dependent vowel signs can also be added to the stack or linear sequence. Consonant clus-
ters containing U+1CF5 vedic sign jihvamuliya and U+1CF6 vedic sign upadhmaniya
can occur with more than two consonants, such as ẖkra and ḫpra.
Punctuation. Bengali uses punctuation marks shared across many Indic scripts, including
the danda and double danda marks. In Bangla these are called the dahri and double dahri.
For a description of these common punctuation marks, see Section 12.1, Devanagari.
Truncation. The orthography of the Bangla language makes use of U+02BC “ ’ ” modifier
letter apostrophe to indicate the truncation of words. This sign is called urdha-comma.
South and Central Asia-I 478 12.2 Bengali (Bangla)
Examples illustrating the use of U+02BC “ ’ ” modifier letter apostrophe are shown in
Table 12-15.
X
Y } above
South and Central Asia-I 479 12.3 Gurmukhi
12.3 Gurmukhi
Gurmukhi: U+0A00–U+0A7F
The Gurmukhi script is a North Indian script used to write the Punjabi (or Panjabi) lan-
guage of the Punjab state of India. Gurmukhi, which literally means “proceeding from the
mouth of the Guru,” is attributed to Angad, the second Sikh Guru (1504–1552 ce). It is
derived from an older script called Landa and is closely related to Devanagari structurally.
The script is closely associated with Sikhs and Sikhism, but it is used on an everyday basis
in East Punjab. (West Punjab, now in Pakistan, uses the Arabic script.)
Encoding Principles. The Gurmukhi block is based on ISCII-1988, which makes it parallel
to Devanagari. Gurmukhi, however, has a number of peculiarities described here.
The additional consonants (called pairin bindi; literally, “with a dot in the foot,” in Pun-
jabi) are primarily used to differentiate Urdu or Persian loan words. They include U+0A36
gurmukhi letter sha and U+0A33 gurmukhi letter lla, but do not include U+0A5C
gurmukhi letter rra, which is genuinely Punjabi. For unification with the other scripts,
ISCII-1991 considers rra to be equivalent to dda+nukta, but this decomposition is not
considered in Unicode. At the same time, ISCII-1991 does not consider U+0A36 to be
equivalent to <0A38, 0A3C>, or U+0A33 to be equivalent to <0A32, 0A3C>.
Two different marks can be associated with U+0902 devanagari sign anusvara:
U+0A02 gurmukhi sign bindi and U+0A70 gurmukhi tippi. Present practice is to use
bindi only with the dependent and independent forms of the vowels aa, ii, ee, ai, oo, and au,
and with the independent vowels u and uu; tippi is used in the other contexts. Older texts
may depart from this requirement. ISCII-1991 uses only one encoding point for both
marks.
U+0A71 gurmukhi addak is a special sign to indicate that the following consonant is
geminate. ISCII-1991 does not have a specific code point for addak and encodes it as a
cluster. For example, the word () pagg, “turban,” can be represented with the sequence
<0A2A, 0A71, 0A17> (or <pa, addak, ga>) in Unicode, while in ISCII-1991 it would be <pa,
ga, virama, ga>.
U+0A75 l gurmukhi sign yakash probably originated as a subjoined form of U+0A2F J
gurmukhi letter ya. However, because its usage is relatively rare and not entirely pre-
dictable, it is encoded as a separate character. Some modern fonts render yakash with the
glyph i , which varies from the traditional shape found in the code charts. This character
should occur after the consonant to which it attaches and before any vowel sign.
U+0A51 m gurmukhi sign udaat occurs in older texts and indicates a high tone. This
character should occur after the consonant to which it attaches and before any vowel sign.
Punjabi does not have complex combinations of consonant sounds. Furthermore, the
orthography is not strictly phonetic, and sometimes the inherent /a/ sound is not pro-
nounced. For example, the word *+,-. gurmukh\ is represented with the sequence
<0A17, 0A41, 0A30, 0A2E, 0A41, 0A16, 0A40>, which could be transliterated as guramukh\;
South and Central Asia-I 480 12.3 Gurmukhi
this lack of pronunciation is systematic at the end of a word. As a result, the virama sign is
seldom used with the Gurmukhi script.
In older texts, such as the Sri Guru Granth Sahib (the Sikh holy book), one can find typo-
graphic clusters with a vowel sign attached to a vowel letter, or with two vowel signs
attached to a consonant. The most common cases are nu attached to K, as in S and
both the vowel signs o and n attached to a consonant, as in T goubinda; this is used to
indicate the metrical shortening of /o/ or the lengthening of /u/ depending on the context.
Other combinations are attested as well, such as U ghiana, represented by the sequence
<U+0A17, U+0A4D, U+0A39, U+0A3F, U+0A3E, U+0A28>.
Because of the combining classes of the characters U+0A4B gurmukhi vowel sign oo
and U+0A41 gurmukhi vowel sign u, the sequences <consonant, U+0A4B, U+0A41>
and <consonant, U+0A41, U+0A4B> are not canonically equivalent. To avoid ambiguity in
representation, the first sequence, with U+0A4B before U+0A41, should be used in such
cases. More generally, when a consonant or independent vowel is modified by multiple
vowel signs, the sequence of the vowel signs in the underlying representation of the text
should be: left, top, bottom, right.
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-16 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.
Tones. The Punjabi language is tonal, but the Gurmukhi script does not contain any spe-
cific signs to indicate tones. Instead, the voiced aspirates (gha, jha, ddha, dha) and the let-
ter ha combine consonantal and tonal functions.
South and Central Asia-I 481 12.3 Gurmukhi
Ordering. U+0A73 gurmukhi ura and U+0A72 gurmukhi iri are the first and third “let-
ters” of the Gurmukhi syllabary, respectively. They are used as bases or bearers for some of
the independent vowels, while U+0A05 gurmukhi letter a is both the second “letter”
and the base for the remaining independent vowels. As a result, the collation order for Gur-
mukhi is based on a seven-by-five grid:
• The first row is U+0A73 ura, U+0A05 a, U+0A72 iri, U+0A38 sa, U+0A39 ha.
• This row is followed by five main rows of consonants, grouped according to the
point of articulation, as is traditional in all South and Southeast Asian scripts.
• The semiconsonants follow in the seventh row: U+0A2F ya, U+0A30 ra,
U+0A32 la, U+0A35 va, U+0A5C rra.
• The letters with nukta, added later, are presented in a subsequent eighth row if
needed.
Rendering Behavior. For general principles regarding the rendering of the Gurmukhi
script, see the rules for rendering in Section 12.1, Devanagari. In many aspects, Gurmukhi
is simpler than Devanagari. In modern Punjabi, there are no half-consonants, no half-
forms, no repha (upper form of U+0930 devanagari letter ra), and no real ligatures.
Rules R2–R5, R11, and R14 do not apply. Conversely, the behavior for subscript RA (rules
R6–R8 and R13) applies to U+0A39 gurmukhi letter ha and U+0A35 gurmukhi let-
ter va, which also have subjoined forms, called pairin in Punjabi. The subjoined form for
RA is like a knot, while the subjoined HA and VA are written the same as the base form,
without the top bar, but are reduced in size. As described in rule R13, they attach at the bot-
tom of the base consonant, and will “push” down any attached vowel sign for U or UU.
When U+0A2F gurmukhi letter ya follows a dead consonant, it assumes a different
form called addha in Punjabi, without the leftmost part, and the dead consonant returns to
the nominal form, as shown in Table 12-17.
/ + 0 + 1 → 2 (mha) pairin ha
3 + 0 + + → 4 (pra) pairin ra
5 + 0 + 6 → 7 (dva) pairin va
5 + 0 + 8 → 59 (dya) addha ya
South and Central Asia-I 482 12.3 Gurmukhi
0 + 0 + A → a (sga) pairin ga
0 + 0 + B → b (sca) pairin ca
0 + 0 + E → e (sta) pairin ta
0 + 0 + F → f (sda) pairin da
0 + 0 + G → g (sna) pairin na
0 + 0 + J → k (sya) pairin ya
0 + 0 + / → j (sma) addha ma
Older texts also exhibit another feature that is not found in modern Gurmukhi—namely,
the use of a half- or reduced form for the first consonant of a cluster, whereas the modern
practice is to represent the second consonant in a half- or reduced form. Joiners can be
used to request this older rendering, as shown in Table 12-19. The reduced form of an ini-
tial U+0A30 gurmukhi letter ra is similar to the Devanagari superscript RA (repha), but
this usage is rare, even in older texts.
0 + 0 + 6 → L (sva)
+ + 0 + 6 → M (rva)
0 + 0 + Ä + 6 → N (sva)
+ + 0 + Ä + 6 → O (rva)
0 + 0 + Ã + 6 → PQ (sva)
+ + 0 + Ã + 6 → RQ (rva)
A rendering engine for Gurmukhi should make accommodations for the correct position-
ing of the combining marks (see Section 5.13, Rendering Nonspacing Marks, and particu-
larly Figure 5-11). This is important, for example, in the correct centering of the marks
above and below U+0A28 gurmukhi letter na and U+0A20 gurmukhi letter ttha,
South and Central Asia-I 483 12.3 Gurmukhi
which are laterally symmetrical. It is also important to avoid collisions between the various
upper marks, vowel signs, bindi, and/or addak.
Other Symbols. The religious symbol khanda sometimes used in Gurmukhi texts is
encoded at U+262C adi shakti in the Miscellaneous Symbols block. U+0A74 gurmukhi
ek onkar, which is also a religious symbol, can have different presentation forms, which
do not change its meaning. The font used in the code charts shows a highly stylized form;
simpler forms look like the digit one, followed by a sign based on ura, along with a long
upper tail.
Punctuation. Danda and double danda marks as well as some other unified punctuation
used with Gurmukhi are found in the Devanagari block. See Section 12.1, Devanagari, for
more information. Punjabi also uses Latin punctuation.
South and Central Asia-I 484 12.4 Gujarati
12.4 Gujarati
Gujarati: U+0A80–U+0AFF
The Gujarati script is a North Indian script closely related to Devanagari. It is most obvi-
ously distinguished from Devanagari by not having a horizontal bar for its letterforms, a
characteristic of the older Kaithi script to which Gujarati is related. The Gujarati script is
used to write the Gujarati language of the Gujarat state in India.
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-20 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.
Rendering Behavior. For rendering of the Gujarati script, see the rules for rendering in
Section 12.1, Devanagari. Like other Brahmic scripts in the Unicode Standard, Gujarati
uses the virama to form conjunct characters. The virama is informally called kho}o, which
means “lame” in Gujarati. Many conjunct characters, as in Devanagari, lose the vertical
stroke; there are also vertical conjuncts. U+0AB0 gujarati letter ra takes special forms
when it combines with other consonants, as shown in Table 12-21.
Marks for Transliteration of Arabic. The combining marks encoded in the range
U+0AFA..U+0AFF are used for the transliteration of the Arabic script into Gujarati. This
system of transliteration was devised in the late 19th century, and is used by Ismaili Khoja
communities. These marks occur both in manuscripts and in printed materials.
The three forms of nukta encoded in the range U+0AFD..U+0AFF are diacritics, placed
above regular Gujarati letters to create new letters corresponding to Arabic letters for non-
Gujarati sounds. U+0AFF gujarati sign two-circle nukta above is used only with
U+0A9D gujarati letter jha, to transliterate the Arabic zah. U+0AFE gujarati sign
South and Central Asia-I 485 12.4 Gujarati
: + ; + < → = (kXa)
>+ ; + ? → @ (jña)
A + ; + B → CB (tya)
D + ; + D → E (YYa)
F + ; + : → G (rka)
: + ; + F → ' (kra)
circle nukta above is used with U+0A9D gujarati letter jha to transliterate the Ara-
bic thal and with U+0AB8 gujarati letter sa to transliterate the Arabic theh. U+0AFD
gujarati sign three-dot nukta above occurs with a number of different Gujarati let-
ters, to transliterate a variety of Arabic letters.
U+0AFA gujarati sign sukun, U+0AFB gujarati sign shadda, and U+0AFC gujarati
sign maddah are used to transliterate the Arabic sukun, shadda, and maddah above,
respectively. These marks may be applied to a Gujarati letter which also uses one of the
three above-base nukta diacritic marks. In such cases, the nukta occurs first in the combin-
ing sequence, followed by the sukun, shadda, or maddah mark. However, instead of being
rendered above the nukta mark on the letter, the sukun, shadda, or maddah mark is ren-
dered to the left of the nukta mark.
Punctuation. Words in Gujarati are separated by spaces. Danda and double danda marks
as well as some other unified punctuation used with Gujarati are found in the Devanagari
block; see Section 12.1, Devanagari.
South and Central Asia-I 486 12.5 Oriya (Odia)
Rendering Behavior. For rendering of the Oriya script, see the rules for rendering in
Section 12.1, Devanagari. Like other Brahmic scripts in the Unicode Standard, Oriya uses
the virama to suppress the inherent vowel. Oriya has a visible virama, often being a length-
ening of a part of the base consonant:
U + > + c → Ud (tya)
Consonant Forms. In the initial position in a cluster, RA is reduced and placed above the
following consonant, while it is also reduced in the second position:
_ + > + ` → a (rpa)
` + > + _ → b (pra)
Nasal and stop clusters may be written with conjuncts, or the anusvara may be used:
< + A → B (kZ)
< + C → D (ki)
< + E→ F (k\)
< + G→ H (ku)
< + I→ J (k^)
< + K → L (kW)
< + M→ N (ke)
< + O→ P (kai)
< + Q→ R (ko)
< + S→ T (kau)
Oriya VA and WA. These two letters are extensions to the basic Oriya alphabet. Because
Sanskrit yx vana becomes Oriya qz bana in orthography and pronunciation, an
extended letter U+0B35 r oriya letter va was devised by dotting U+0B2C p oriya let-
ter ba for use in academic and technical text. For example, basic Oriya script cannot dis-
tinguish Sanskrit wy bava from ww baba or yy vava, but this distinction can be made
with the modified version of ba. In some older sources, the glyph N is sometimes found for
va; in others, P and Q have been shown, which in a more modern type style would be R.
The letter va is not in common use today.
In a consonant conjunct, subjoined U+0B2C p oriya letter ba is usually—but not
always—pronounced [wa]:
U+0B15 1 ka + U+0B4D B virama + U+0B2C C ba '→ 1A [kwa]
U+0B2E M ma + U+0B4D B virama + U+0B2C C ba '→ MA [mba]
The extended Oriya letter U+0B71 T oriya letter wa is sometimes used in Perso-Arabic
or English loan words for [w]. It appears to have originally been devised as a ligature of V o
and p ba, but because ligatures of independent vowels and consonants are not normally
used in Oriya, this letter has been encoded as a single character that does not have a
decomposition. It is used initially in words or orthographic syllables to represent the for-
eign consonant; as a native semivowel, virama + ba is used because that is historically
accurate. Glyph variants of wa are S, U, and VW.
Punctuation and Symbols. Danda and double danda marks as well as some other unified
punctuation used with Oriya are found in the Devanagari block; see Section 12.1, Devana-
gari. The mark U+0B70 oriya isshar is placed before names of persons who are deceased.
The sacred syllable om is formed by U+0B13 oriya letter o and U+0B01 oriya sign
candrabindu. Ligation of the two glyphs can be encouraged or discouraged by the use of
U+200D zero width joiner or U+200C zero width non-joiner between the two char-
acters, as seen in Table 12-25. In the absence of a joiner, both the non-ligated and the
ligated forms are acceptable renderings.
A + Ä + ^ → B or C
D + Ã + ^ → E
Fraction Characters. As for many other scripts of India, Oriya has characters used to
denote factional values. These were more commonly used before the advent of decimal
weights, measures, and currencies. Oriya uses six signs: three for quarter values (1/4, 1/2,
3/4) and three for sixteenth values (1/16, 1/8, and 3/16). These are used additively, with
quarter values appearing before sixteenths. Thus U+0B73 oriya fraction one half fol-
lowed by U+0B75 oriya fraction one sixteenth represents the value 5/16.
South and Central Asia-I 489 12.6 Tamil
12.6 Tamil
Tamil: U+0B80–U+0BFF
The Tamil script is descended from the South Indian branch of Brahmi. It is used to write
the Tamil language of the Tamil Nadu state in India as well as minority languages such as
Irula, the Dravidian language Badaga, and the Indo-European language Saurashtra. Tamil
is also used in Sri Lanka, Singapore, and parts of Malaysia.
The Tamil script has fewer consonants than the other Indic scripts. When representing the
“missing” consonants in transcriptions of languages such as Sanskrit or Saurashtra, super-
script European digits are often used, so 2 = pha, 3 = ba, and 4 = bha. The characters
U+00B2, U+00B3, and U+2074 can be used to preserve this distinction in plain text. The
Grantha script is often also used by Tamil speakers to write Sanskrit because Grantha con-
tains these missing consonants.
The Tamil script also avoids the use of conjunct consonant forms, although a few conven-
tional conjuncts are used.
Virama (Pu!!i). Because the Tamil encoding in the Unicode Standard is based on ISCII-
1988 (Indian Script Code for Information Interchange), it makes use of the abugida model.
An abugida treats the basic consonants as containing an inherent vowel, which can be can-
celed by the use of a visible mark, called a virama in Sanskrit. In most Brahmi-derived
scripts, the placement of a virama between two consonants implies the deletion of the
inherent vowel of the first consonant and causes a conjoined or subjoined consonant clus-
ter. In those scripts, zero width non-joiner is used to display a visible virama, as shown
previously in the Devangari example in Figure 12-4.
The situation is quite different for Tamil because the script uses very few consonant con-
juncts. An orthographic cluster consisting of multiple consonants (represented by <C1,
U+0BCD tamil sign virama, C2, ...>) is normally displayed with explicit viramas, which
are called pukki in Tamil. The pukki is typically rendered as a dot centered above the charac-
ter. It occasionally appears as small circle instead of a dot, but this glyph variant should be
handled by the font, and not be represented by the similar-appearing U+0B82 tamil sign
anusvara.
The conjuncts kssa and shrii are traditionally displayed by conjunct ligatures, as illustrated
for kssa in Figure 12-15, but nowadays tend to be displayed using an explicit pukki as well.
μ + Ä|| + ◊ → a kXa
To explicitly display a pukki for such sequences, zero width non-joiner can be inserted
after the pukki in the sequence of characters.
South and Central Asia-I 490 12.6 Tamil
Rendering of the Tamil Script. The Tamil script is complex and requires special rules for
rendering. The following discussion describes the most important features of Tamil ren-
dering behavior. As with any script, a more complex procedure can add rendering charac-
teristics, depending on the font and application.
In a font that is capable of rendering Tamil, the number of glyphs is greater
than the number of Tamil characters.
Tamil Vowels
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-26 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.
Independent Versus Dependent Vowels. In the Tamil script, the dependent vowel signs are
not equivalent to a sequence of virama + independent vowel. For example:
… + Äw ≠…+ Ä| +ß
Left-Side Vowels. The Tamil vowels U+0BC6 ÊÄ, U+0BC7 ÁÄ, and U+0BC8 ËÄ are
reordered in front of the consonant to which they are applied. When occurring in a sylla-
ble, these vowels are rendered to the left side of their consonant, as shown in Figure 12-16.
In these examples, the representation on the left, which is a single code point, is the pre-
ferred form and the form in common use for Tamil.
In the process of rendering, these two-part vowels are transformed into the two separate
glyphs equivalent to those on the right, which are then subject to vowel reordering, as
shown in Figure 12-18.
Tamil Ligatures
A number of ligatures are conventionally used in Tamil. Most ligatures involve the shape
taken by a consonant plus vowel sequence. A wide variety of modern Tamil words are writ-
ten without a conjunct form, with a fully visible pukki.
Ligatures with Vowel i. The vowel signs i Ä w and iiÄ« form ligatures with the consonant
tta ø as shown in examples 1 and 2 of Figure 12-21. These vowels often change shape or
position slightly so as to join cursively with other consonants, as shown in examples 3 and
4 of Figure 12-21.
1 ø +Ä w →C Yi
2 ø + Ä« →D Y\
3 “ +Ä w → Ñ li
4 “ + Ä« → Ö l\
Ligatures with Vowel u. The vowel signs uÄõ and uuÄú normally ligate with their conso-
nant, as shown in Table 12-27. In the first column, the basic consonant is shown; the sec-
ond column illustrates the ligation of that consonant with the u vowel sign; and the third
column illustrates the ligation with the uu vowel sign.
º + Äõ → º˜ ju
º + Äú → º¯ j^
Ligatures with ra. Based on typographical preferences, the consonant ra – may change
shape to fi, when it ligates. Such change, if it occurs, will happen only when the fi form of
U+0BB0 – tamil letter ra would not be confused with the nominal form fi of U+0BBE
tamil vowel sign aa (namely, when – is combined withÄ|, Ä w , orÄ« ). This change in
shape is illustrated in Figure 12-23.
– +Ä| → l r
– +Ä w → m ri
– + Ä« → n r\
However, various governmental bodies mandate that the basic shape of the consonant ra –
should be used for these ligatures as well, especially in school textbooks. Media and literary
publications in Malaysia and Singapore mostly use the unchanged form of ra –. Sri Lanka,
on the other hand, specifies the use of the changed forms shown in Figure 12-24.
Tamil Ligature shri. Prior to Unicode 4.1, the best mapping to represent the ligature shri
was to the sequence <U+0BB8, U+0BCD, U+0BB0, U+0BC0>. Unicode 4.1 in 2005 added
the character U+0BB6 tamil letter sha and as a consequence, the best mapping became
South and Central Asia-I 494 12.6 Tamil
7 + Ä| +– + Ä« → 8
6 + Ä| +– + Ä« → 8
Ligatures with aa in Traditional Tamil Orthography. In traditional Tamil orthography,
the vowel sign aa Äfi optionally ligates with √, …, or —, as illustrated in Figure 12-25.
√ + Äfi → @ DZ
… + Äfi → A hZ
— + Äfi → B 9Z
These ligations also affect the right-hand part of two-part vowels, as shown in Figure 12-26.
√ + ÊÄfi → Ê@ Do
√ + ÁÄfi → Á@ D]
… + ÊÄfi → ÊA ho
… + ÁÄfi → ÁA h]
— + ÊÄfi → ÊB 9o
— + ÁÄfi → ÁB 9]
√ + ËÄ → È√ Dai
… + ËÄ → È… hai
“ + ËÄ → È“ lai
” + ËÄ → È” kai
By contrast, in modern Tamil orthography, this vowel does not change its shape, as shown
in Figure 12-28.
√ + ËÄ → Ë√ Dai
Tamil aytham. The character U+0B83 tamil sign visarga is normally called aytham in
Tamil. It is historically related to the visarga in other Indic scripts, but has become an ordi-
nary spacing letter in Tamil. The aytham occurs in native Tamil words, but is frequently
used as a modifying prefix before consonants used to represent foreign sounds. In particu-
lar, it is used in the spelling of words borrowed into Tamil from English or other languages.
Punctuation. Danda and double danda marks as well as some other unified punctuation
used with Tamil are found in the Devanagari block; see Section 12.1, Devanagari.
Numbers. Modern Tamil decimal digits are encoded at U+0BE6..U+0BEF. Note that some
digits are confusable with letters, as shown in Table 12-28. In some Tamil fonts, the digits
for two and eight look exactly like the letters u and a, respectively. In other fonts, as shown
here, the shapes for the digits two and eight are adjusted to minimize confusability.
Tamil also has distinct numerals for ten, one hundred, and one thousand at
U+0BF0..U+0BF2 used for historical numbers.
Use of Nukta. In addition to Tamil, several other languages of southern India are written
using the Tamil script. For example, Irula is written with the Tamil script. Some of these
languages contain sounds distinct from those normally written for the Tamil language. In
South and Central Asia-I 496 12.6 Tamil
such cases, the writing systems of these languages apply diacritic nukta marks to Tamil let-
ters to represent their distinct sounds. For example, Irula uses a double dot nukta below for
some sounds. That nukta can be represented with U+1133C grantha sign nukta.
12.7 Telugu
Telugu: U+0C00–U+0C7F
The Telugu script is a South Indian script used to write the Telugu language of the Andhra
Pradesh state in India as well as minority languages such as Gondi (Adilabad and Koi dia-
lects) and Lambadi. The script is also used in Maharashtra, Odisha (Orissa), Madhya
Pradesh, and West Bengal. The Telugu script became distinct by the thirteenth century ce
and shares ancestors with the Kannada script.
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-30 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.
Rendering Behavior. Telugu script rendering is similar to that of other Brahmic scripts in
the Unicode Standard—in particular, the Tamil script. Unlike Tamil, however, the Telugu
script writes conjunct characters with subscript letters. Many Telugu letters have a v-
shaped headstroke, which is a structural mark corresponding to the horizontal bar in
Devanagari and the arch in Oriya script. When a virama (called virZmamu in Telugu) or
certain vowel signs are added to a letter with this headstroke, it is replaced:
2
U+0C15 ka + U+0C4D 3 virama + U+200C Ã zero width non-
4
joiner → (k)
U+0C15 2 ka + U+0C3F 5 vowel sign i → 6 (ki)
Telugu consonant clusters are most commonly represented by a subscripted, and often
transformed, consonant glyph for the second element of the cluster:
U+0C17 < ga + U+0C4D 3 virama + U+0C17 < ga → <= (gga)
U+0C15 2 ka + U+0C4D 3 virama + U+0C15 2 ka → 29 (kka)
U+0C15 2 ka + U+0C4D 3 virama + U+0C2F : ya → 2; (kya)
U+0C15 2 ka + U+0C4D 3 virama + U+0C37 > ssa → 2? (kXa)
South and Central Asia-I 500 12.7 Telugu
NakZra-Pollu. The sequence <U+0C28 telugu letter na, U+0C4D telugu sign
virama> can have two representations in Telugu text. The first is the “regular” or “new
style” form D, which takes its shape from the glyphs in the sequence <U+0C28 C telugu
letter na , U+0C4D y telugu sign virama>. Older texts display the other vowel-less
form F, called nakZra-pollu. The two forms are semantically identical. Fonts should ren-
der the sequence <U+0C28 telugu letter na, U+0C4D telugu sign virama> with
either the old-style glyph For the new style glyph D. The character U+200C zero width
non-joiner can be used to prevent interaction of this sequence with following consonants,
as shown in Table 12-31.
Reph. In modern Telugu, U+0C30 telugu letter ra behaves in the same manner as most
other initial consonants in a consonant cluster. That is, the ra appears in its nominal form,
and the second consonant takes the C2-conjoining or subscripted form:
U+0C30 x ra + U+0C4D 3 virama + U+0C2E z ma → xB (rma)
However, in older texts, U+0C30 telugu letter ra takes the reduced (or reph) form A
when it appears first in a consonant cluster, and the following consonant maintains its
nominal form:
U+0C30 x ra + U+0C4D 3 virama + U+0C2E z ma → zA (rma)
U+200D zero width joiner is placed immediately after the virama to render the reph
explicitly in modern texts:
U+0C30 x ra + U+0C4D 3 virama + U+200D Ä ZWJ + U+0C2E z ma
→ zA
To prevent display of a reph, U+200D zero width joiner is placed after the ra, but pre-
ceding the virama:
U+0C30 x ra + U+200D Ä ZWJ + U+0C4D 3 virama + U+0C2E z ma
→ xB
Special Characters. U+0C55 telugu length mark is provided as an encoding for the
second element of the vowel U+0C47 telugu vowel sign ee. U+0C56 telugu ai length
mark is provided as an encoding for the second element of the surroundrant vowel
U+0C48 telugu vowel sign ai. The length marks are both nonspacing characters. For a
detailed discussion of the use of two-part vowels, see “Two-Part Vowels” in Section 12.6,
Tamil.
South and Central Asia-I 501 12.7 Telugu
Fractions. Prior to the adoption of the metric system, Telugu fractions were used as part of
the system of measurement. Telugu fractions are quaternary (base-4), and use eight marks,
which are conceptually divided into two sets. The first set represents odd-numbered nega-
tive powers of four in fractions. The second set represents even-numbered negative powers
of four in fractions. Different zeros are used with each set. The zero from the first set is
known as hakki, U+0C78 telugu fraction digit zero for odd powers of four. The
zero for the second set is U+0C66 telugu digit zero.
Punctuation. Danda and double danda are used primarily in the domain of religious texts
to indicate the equivalent of a comma and full stop, respectively. The danda and double
danda marks as well as some other unified punctuation used with Telugu are found in the
Devanagari block; see Section 12.1, Devanagari.
South and Central Asia-I 502 12.8 Kannada
12.8 Kannada
Kannada: U+0C80–U+0CFF
The Kannada script is a South Indian script. It is used to write the Kannada (or Kanarese)
language of the Karnataka state in India and to write minority languages such as Tulu. The
Kannada language is also used in many parts of Tamil Nadu, Kerala, Andhra Pradesh, and
Maharashtra. This script is very closely related to the Telugu script both in the shapes of the
letters and in the behavior of conjunct consonants. The Kannada script also shares many fea-
tures common to other Indic scripts. See Section 12.1, Devanagari, for further information.
The Unicode Standard follows the ISCII layout for encoding, which also reflects the tradi-
tional Kannada alphabetic order.
Consonant Conjuncts. Kannada is also noted for a large number of consonant conjunct
forms that serve as ligatures of two or more adjacent forms. This use of ligatures takes place
South and Central Asia-I 503 12.8 Kannada
Kannada script need to be aware that these sequences involving independent vowels fol-
lowed by virama and U+0CDE are valid and required in orthographies for Badaga. Exam-
ples of the use of subjoined U+0CDE to indicate retroflexion, both for independent vowel
letters and for dependent vowels, are shown in Figure 12-29.
ಉ$ ೞ → ಉ
ೞ
0C89 0CCD 0CDE
ಯ$ ೞ
0CAF 0CCD 0CDE
$ೆ0CC6
→
ೞ
Rendering Kannada
Plain text in Kannada is generally stored in phonetic order; that is, a CV syllable with a
dependent vowel is always encoded as a consonant letter C followed by a vowel sign V in
the memory representation. This order is employed by the ISCII standard and corresponds
to the phonetic and keying order of textual data. Unlike in Devanagari and some other
Indian scripts, all of the dependent vowels in Kannada are depicted to the right of their
consonant letters. Hence there is no need to reorder the elements in mapping from the log-
ical (character) store to the presentation (glyph) rendering, and vice versa.
Explicit Virama (Halant). Normally, a halant character creates dead consonants, which in
turn combine with subsequent consonants to form conjuncts. This behavior usually results
in a halant sign not being depicted visually. Occasionally, this default behavior is not
desired when a dead consonant should be excluded from conjunct formation, in which
case the halant sign is visibly rendered. To accomplish this, U+200C zero width non-
joiner is introduced immediately after the encoded dead consonant that is to be excluded
from conjunct formation. See Section 12.1, Devanagari, for examples.
Vowelless NA. The sequence <U+0CA8 kannada letter na, U+0CCD kannada sign
virama> can have two representations in Kannada text. The first is the “regular” or “new
style” form n, which takes its shape from the glyphs in the sequence <U+0CA8 kannada
letter na, U+0CCD kannada sign virama>. Older texts display the other vowel-less
form o. The two forms are semantically identical. Fonts should render the sequence
<U+0CA8 kannada letter na, U+0CCD kannada sign virama> with either the old-
style glyph o or the new style glyph n. The character U+200C zero width non-joiner
can be used to prevent interaction of this sequence with the following consonants, as
shown in Table 12-33.
See the discussion of the analogous rendering of na in Telugu, called nakQra-pollu, in
Section 12.7, Telugu.
Consonant Clusters Involving RA. Whenever a consonant cluster is formed with the
U+0CB0 D kannada letter ra as the first component of the consonant cluster, the letter
South and Central Asia-I 505 12.8 Kannada
ra is depicted with two different presentation forms: one as the initial element and the
other as the final display element of the consonant cluster.
U+0CB0 D ra + U+0CCD @ halant + U+0C95 I ka → IK rka
U+0CB0 D ra + Ä + U+0CCD @ halant + U+0C95 I ka → DL rka
U+0C95 I ka + U+0CCD @ halant + U+0CB0 D ra → IJ kra
Jihvamuliya and Upadhmaniya. Voiceless velar and bilabial fricatives in Kannada are rep-
resented by U+0CF1 kannada sign jihvamuliya and U+0CF2 kannada sign upadh-
maniya, respectively. When the signs appear with a following homorganic voiceless stop
consonant, the combination should be rendered in the font as a stacked ligature, without a
virama:
U+0CF1 ೱ jihvamuliya + U+0C95 ಕ ka ೱ
U+0CF2 ೲ upadhmaniya + U+0CAB ಫ pha ೲ
Modifier Mark Rules. In addition to the vowel signs, one or more types of combining
marks may be applied to a component of a written syllable or the syllable as a whole. If the
consonant represents a dead consonant, then the nukta should precede the halant in the
memory representation. The nukta is represented by a double-dot mark, U+0CBC E kan-
nada sign nukta. Two such modified consonants are used in the Kannada language: one
representing the syllable za and one representing the syllable fa.
Avagraha Sign. A spacing mark, U+0CBD F kannada sign avagraha, is used when ren-
dering Sanskrit texts.
Punctuation. Danda and double danda marks as well as some other unified punctuation
used with this script are found in the Devanagari block; see Section 12.1, Devanagari.
South and Central Asia-I 506 12.9 Malayalam
12.9 Malayalam
Malayalam: U+0D00–U+0D7F
The Malayalam script is a South Indian script used to write the Malayalam language of the
Kerala state. Malayalam is a Dravidian language like Kannada, Tamil, and Telugu.
Throughout its history, it has absorbed words from Tamil, Sanskrit, Arabic, and English.
The shapes of Malayalam letters closely resemble those of Tamil. Malayalam, however, has
a very full and complex set of conjunct consonant forms.
Vowel Letters. Vowel letters are encoded atomically in Unicode, even if they can be ana-
lyzed visually as consisting of multiple parts. Table 12-34 shows the letters that can be ana-
lyzed, the single code point that should be used to represent them in text, and the sequence
of code points resulting from analysis that should not be used.
Two-Part Vowels. The Malayalam script uses several two-part vowel characters. In modern
times, the dominant practice is to write the dependent form of the au vowel using only “w”,
which is placed on the right side of the consonant it modifies; such texts are represented in
Unicode using U+0D57 malayalam au length mark. In the past, this dependent form
was written using both “v” on the left side and “w” on the right side; U+0D4C malayalam
vowel sign au can be used for documents following this earlier tradition. This historical
simplification started much earlier than the orthographic reforms mentioned in the text
that follows.
For a detailed discussion of the use of two-part vowels, see “Two-Part Vowels” in
Section 12.6, Tamil.
Historic Characters. The four characters, avagraha, vocalic rr sign, vocalic l sign, and
vocalic ll sign, are only used to write Sanskrit words in the Malayalam script. The avagraha
is the most common of the four. The vocalic l sign is also commonly used in Sanskrit words.
Two specific forms of viramas are found in historical materials. The U+0D3B malayalam
sign vertical bar virama was used to indicate a pure consonant when transliterating
foreign words, while the U+0D3C malayalam sign circular virama was employed to
indicate a pure consonant in native Malayalam texts.
South and Central Asia-I 507 12.9 Malayalam
Suriyani Malayalam. The Suriyani dialect of Malayalam is written using the Syriac script.
It is also called Garshuni (Karshoni) or Syriac Malayalam. This usage requires eleven addi-
tional letters encoded in the Syriac Supplement block (U+0860..U+086F) to represent the
sounds of Malayalam. The dialect was widely used by the St. Thomas Christians living in
Kerala, India, in the 19th century.
Rendering Malayalam
Candrakkala. As is the case for many other Brahmi-derived scripts in the Unicode Stan-
dard, Malayalam uses a virama character to form consonant conjuncts. The virama sign
itself is known as candrakkala in Malayalam. Table 12-36 provides a variety of examples of
consonant conjuncts. There are both horizontal and vertical conjuncts, some of which
ligate, and some of which are merely juxtaposed.
\ + $ + \ → ( (kka)
_ + $ + ) → * ( jña)
+ + $ + + → , (YYa)
- + $ + - → . (ppa)
/ + $ + 0 → 1 (ccha)
2 + $ + 2 → 3 (bba)
b + $ + = → b> (nya)
- + $ + d → B (pra)
e + $ + o → e@ (#va)
When the candrakkala sign is visibly shown in Malayalam, it indicates either the suppres-
sion of the preceding vowel or its replacement with a neutral vowel sound. This sound is
often called “half-u” or samvruthokaram. In traditional orthography it is displayed with a
vowel sign -u followed by candrakkala, and in modern orthography it is displayed with a
candrakkala alone. In all cases, the candrakkala sign is represented by the character
U+0D4D malayalam sign virama, which follows any vowel sign that may be present and
precedes any anusvara that may be present. Examples are shown in Table 12-37.
Explicit Candrakkala. The sequence <C1, virama, ZWNJ, C2>, where C1 and C2 are con-
sonants, may be used to request display with an explicit visible candrakkala, instead of the
default conjunct form. See Table 12-38 for an example. This convention is consistent with
the use of this sequence in other Indic scripts.
Requesting Traditional Ligatures. The sequence <C1, ZWJ, virama, C2> may be used to
request traditional ligatures, even if the current font defaults to the conjuncts appropriate
for the reformed orthography. When such sequences occur, a closed or cursively connected
ligature should be displayed, if available. See Table 12-38 for examples. This convention is
consistent with the use of this sequence in some other Indic scripts, such as Kannada,
Oriya, and Telugu.
Requesting Open Forms of Conjuncts. The sequence <C1, ZWNJ, virama, C2> may be
used to request open ligatures or those used in the reformed orthography, even if the cur-
rent font defaults to the conjuncts appropriate for the traditional orthography. When such
sequences occur, an open or disconnected conjunct form should be displayed, if available.
See Table 12-38 for examples. Note that such sequences are defined for Malayalam only,
and are left undefined for other Indic scripts.
/+1 + A → C or T (kra)
*+ 1 + / → D or E (ska)
F +1 + * → G or H (tsa)
I +1 + J → S or L or M (rva)
N+1 +N → O (yya)
/+1 +Ã +A → P (kra)
/+Ä +1 +A → T (kra)
*+ Ä + 1 + / → D (ska)
F +Ä +1 +* → G (tsa)
I +Ä +1 +J → S (rva)
/+Ã +1 +A → C (kra)
I +Ã +1 +J → L (rva)
N+Ã +1 +N → R (yya)
Anusvara. The anusvara can be seen multiple times after vowels, whether independent let-
ters or dependent vowel signs, as in vxxxx <0D08, 0D02, 0D02, 0D02, 0D02>. Vowel
South and Central Asia-I 510 12.9 Malayalam
signs can also be seen after digits, as in 355wx <0033, 0035, 0035, 0D3E, 0D02>. More gen-
erally, rendering engines should be prepared to handle Malayalam letters (including vowel
letters), digits (both European and Malayalam), dashes, U+00A0 no-break space and
U+25CC dotted circle as base characters for the Malayalam vowel signs, U+0D4D
malayalam sign virama, U+0D02 malayalam sign anusvara, and U+0D03 malay-
alam sign visarga. They should also be prepared to handle multiple combining marks on
those bases.
Dot Reph. U+0D4E malayalam letter dot reph is used to represent the dead conso-
nant form of U+0D30 malayalam letter ra, when it is displayed as a dot or small vertical
stroke above the consonant that follows it in logical order. It has the character properties of
a letter rather than those of a combining mark, but special behavior is required in imple-
mentations. Conceptually, dot reph is analogous to the sequence <ra, virama> which, in
many Indic scripts, is rendered as a reph mark over the following consonant. This same
behavior is expected for dot reph: it should be rendered as a mark over the following con-
sonant. In standard Malayalam, the sequence <ra, virama> would normally occur only
within the sequence <ra, virama, ya>, which should be rendered as the nominal form of ra
with a conjoining form of ya.
The sequence <ra, virama, ZWJ> is not used to represent the dot reph, because that
sequence has considerable preexisting usage to represent the chillu form of ra, prior to the
encoding of the chillu form as a distinct character, U+0D7C malayalam letter chillu
rr.
The Malayalam dot reph was in common print usage until 1970, but has fallen into disuse.
Words that formerly used dot reph on a consonant are now spelled instead with a chillu-rr
form preceding the consonant. (See the following discussion of chillu characters.) The dot
reph form is predominantly used by those who completed elementary education in Malay-
alam prior to 1970.
Chillu Forms. The six characters, U+0D7A..U+0D7F, encode dead consonants (those
without an inherent vowel) known as chillu or cillakXaram. In Malayalam language text,
chillu forms never start a word. Occasionally, chillu forms may take vowels or be elements
of conjuncts. The chillu forms nn, -n, -rr, -l, and -ll are quite common; chillu-k is relatively
rare in contemporary usage.
For backward-compatibility issues regarding the representation of chillu forms, see the dis-
cussion of legacy chillu sequences later in this section.
Special Cases Involving rra. There are a number of textual representation and reading
issues involving the letter rra. These issues are discussed here and tables of explicit exam-
ples are presented.
The letter x rra is normally read /sa/. Repetition of that sound is naturally written by
repeating the letter: xx. Each occurrence can bear a vowel sign.
The same repetition of the letter rra as xx is also used for /uua/, which can be unambigu-
ously represented by y. The sequence of two x letters fundamentally behaves as a digraph
in this instance. The digraph can bear a vowel sign in which case the digraph as a whole acts
South and Central Asia-I 511 12.9 Malayalam
graphically as an atom: a left vowel part goes to the left of the digraph and a right vowel part
goes to the right of the digraph. Historically, the side-by-side form was used until around
1960 when the stacked form began appearing and supplanted the side-by-side form.
As a consequence the graphical sequence xx in text is ambiguous in reading. The reader
must generally use the context to understand if xx is read /sasa/ or /uua/. It is only when a
vowel part appears between the two x that the reading cannot be /uua/. Note that similar
situations are common in many other orthographies. For example, th in English can be a
digraph (cathode) or two separate letters (cathouse); gn in French can be a digraph
(oignon) or two separate letters (gnome).
The sequence <0D31, 0D31> is rendered as xx, regardless of the reading of that text. The
sequence <0D31, 0D4D, 0D31> is rendered as y. In both cases, vowels signs can be used as
appropriate, as shown in Table 12-39.
A very similar situation exists for the combination of ; chillu-n and x rra. When used side
by side, ;x can be read either /vsa/ or /vua/, while stacked z is always read /vua/.
The sequence <0D7B, 0D31> is rendered as ;x, regardless of the reading of that text. The
sequence <0D7B, 0D4D, 0D31> is rendered as z. In both cases, vowels signs can be used
as appropriate, as shown in Table 12-40.
Legacy Chillu Sequences. Prior to Unicode Version 5.1, the representation of text with
chillu forms was problematic, and not clearly described in the text of the standard. Because
older data will use different representation for chillu forms, implementations must be pre-
pared to handle both kinds of data. For chillu forms considered in isolation, the following
table shows the relationship between their representation in Version 5.0 and earlier, and
the recommended representation starting with Version 5.1. Note that only the first five
chillu forms listed in Table 12-41 were represented in legacy text by <virama, ZWJ>
sequences. The other chillu forms are only represented as atomically encoded chillu char-
acters.