Myanmar Uni-V2
Myanmar Uni-V2
Introduction
The first edition of this technical note addressed the issue of how Myanmar text was encoded using the
Unicode standard as it stood until version 5.1. With Unicode 5.1 various new characters were added to the
Myanmar block which had the effect of simplifying the encoding model considerably. Such a change could
only come about with agreement from all implementors and those with existing data because they will need
to update and change to the new model. This is nearly impossible to achieve if existing implementations are
already in widespread use, which was not the case at the time for the Myanmar block. In addition, such a
change was necessary to facilitate the encoding of minority scripts. So with a necessity and a unique
opportunity for change, the characters were added and the encoding model simplified.
This technical note describes the simplified model and keeps the older model description as a later section
for comparison. The information is structured to follow closely the previous edition of this technical note.
The author wishes to thank the Myanmar Language Commission, the Myanmar NLP Lab and the Myanmar
Computer Federation for reviewing and providing input to this version of the document.
1
SIL International and Payap University, Chiang Mai, THAILAND
Representing Myanmar in Unicode Page 1 of 15
Unicode 5.1 Model
Basic Myanmar
The basic consonants and vowels are relatively obvious in how they are encoded. Thus:
Syllable Chaining
In the case of syllable chaining, subjoined characters are not given their own codes. Instead a virama
character is used to indicate that the following character is subjoined and should take a subjoined form.
Devoweliser
There are two ways of representing the devowelising process. The first is by creating a medial or syllable
chained form, using U+1039 to mark the devowelising (as shown above). The second is to use the visible
virama character ¬f (U+103A MYANMAR SIGN ASAT) in conjunction with a base consonant.
Notice the general order of: initial consonant cluster, vowels, tones.
For example:
Advanced Issues
So far we have covered what is explained in the Unicode Standard4. In this section we examine some of the
more difficult areas of the Myanmar language including some implementation details regarding line
breaking and sorting; further examination of the kinzi question; contractions and some issues with respect to
Old Myanmar.
Line breaking
Myanmar does not have interword spaces like English. Instead spaces are used to mark phrases. Some
phrases are relatively short (two or three syllables, 1.5em, or 2.3 times the width of U+1000 u) while others
can be quite long (8.5em or 13 times the width of U+1000 u). A common approach to addressing line
breaking issues is to adjust the phrase spacing so that a line breaks at a phrase break. If this approach fails
and a phrase must be continued onto a second line, U+200B ZERO WIDTH SPACE may be used to indicate a
possible line break point in the text.
2
Notice the extension of the list here to include independent vowels. The Unicode Standard V4.0 only lists values up to
U+102A. U+104E has changed glyph and can function in consonant position as in ¨if: (104E 1004 103A 1038)
3
Only for use with Kinzi and contractions
4
Version 5.1
Representing Myanmar in Unicode Page 4 of 15
The problem with this approach is that when phrases are quite long or a lot of text is to be typeset, the
manual adjustment of phrasing or the introduction of zero width spaces can be onerous. A further option is to
break lines automatically within phrases when needed. The clearest solution is to have a line break occurring
at a word boundary, but since there are no word breaks in Myanmar this is not immediately possible. Most
words, though, are mono-syllabic and so a mechanism of breaking lines at syllable boundaries is usually
sufficient. From this we can say that a syllable break may occur before a Myanmar digit, an independent
vowel, one of the various signs or a base consonant so long as the consonant:
• is not devowelised with an asat and
• has no stacked consonant below it and
• is not a kinzi.
These same syllable breaking rules are used for sorting purposes, with the addition of non-line breaking
syllable breaks, such as those occurring between the two characters in a syllable chain. For example these
phrases show possible inter-syllable line breaks.
1000 1031 102C 1004 103A | 101C 1031 1038 |
aumifav:awGausmif: 1010 103D 1031 | 1000 103B 1031 102C 1004 103A the kids are
uko
d mG :juw<f` 1038
1000
| 1000 102D 102F | 101E 103D 102C 1038 |
103C | 1010 101A 103A 104B
going to school
1021 102D 1015 103A | 1001 1014 103A 1038 | to the bedroom
f ef:wHcg:ukd
tdyc 1010 1036 | 1001 102B 1038 | 1000 102D 102F door
Notice how in the second example the word 1010 1036 | 1001 102B 1038 is a single word with multiple
syllables. Is there some way, without a dictionary, that we can ensure that the word is not line broken? There
is a Unicode character that was added for version 4.0: U+2060 WORD JOINER. Previous to this the character
U+FEFF ZERO WIDTH NON-BREAKING SPACE was used. Since U+FEFF is most commonly used at the start of a
Unicode text file to both identify it as being Unicode data and to indicate the encoding form of the data,
U+2060 was added to the standard to take over the function of zero width non-breaking space. The role of
this character is to indicate a non-breaking point in a text. Lines should not be broken at that point.
Therefore, if we want to ensure that no line-break occurs at the syllable boundary within our poly-syllabic
word, we can insert a U+2060 into our data stream between the two syllables and a rendering engine should
not break a line at that point. Thus:
1021 102D 1015 103A | 1001 1014 103A 1038 | to the bedroom
f ef:wHcg:ukd
tdyc 1010 1036 2060 1001 102B 1038 | 1000 102D 102F door
In summary, therefore, we propose three levels of line breaking support: breaking at phrase spaces; breaking
at syllable breaks and support for polysyllabic words. A rendering engine may choose the sophistication of
line breaking support it provides.
Sorting
Sorting Myanmar strings is a complex process involving significant string transformation and four levels of
comparison. The string transformation is a syllable based operation for which the identification of syllable
boundaries (but not word boundaries) are required. The same techniques that are used for line-breaking,
therefore, may be used for sorting.
Kinzi revisited
One of the significant improvements brought about by the addition of the asat character is that kinzi is now
unambiguously encoded. Thus:
Contractions
The Myanmar language has a system of double acting consonants, where a consonant acts as both the final
of a syllable and the initial of a following syllable. These are significant for sorting purposes. Double acting
consonants are rare, but occur in two common words.
Old Myanmar
There are a few issues that storing old Myanmar text introduce, although again, most of these are resolved
due to the simplified encoding model.
Stacked Ya
There are occasions where a medial ya (U+103B) representation is used for a stacking ya. What is needed is a
syllable break between the base consonant and the ya. Thus we propose:
Introduction
This section contains most of the text of the original edition of UTN#11 and was written in conjunction with
Maung Tuntunlwin5.
Basic Myanmar
The basic consonants and vowels are relatively obvious in how they are encoded. Thus:
5
Myanmar World Distribution
Representing Myanmar in Unicode Page 8 of 15
In Unicode this devowelising process is marked using the virama code (U+1039 MYANMAR SIGN VIRAMA).
Thus we store a consonant followed by the virama and then follow it with the consonant of interest.
Devoweliser
There are two ways of representing the devowelising process. The first is by creating a medial or syllable
chained form, using U+1039 to mark the devowelising. The second is to use the visible virama character (¬f)
in conjunction with a base consonant. But if U+1039 is being used to mark medials and syllable chaining,
how is the visible character to be represented? The Unicode standard gives the answer. The sequence U+1039
MYANMAR SIGN VIRAMA followed by U+200C ZERO WIDTH NON-JOINER is used to represent a visual virama
(¬f).6
Kinzi
The remaining issue regarding representation needed for the modern Myanmar language is how kinzi is
represented in Unicode. Glyph based encodings give the kinzi its own code. But linguistically, the kinzi is
merely a special form of a devowelised nga (U+1004 MYANMAR LETTER NGA). Thus we encode kinzi as
U+1004 U+1039.
6
For fallback purposes, U+1039 also displays a ¬f if not followed by a consonant. This is an implementation detail and is not
used in spelling words. I.e. all such occurrences should be considered wrong spellings.
7
Notice the addition to the list of independent vowels. The Unicode Standard v4.0 only lists values up to U+1021.
Representing Myanmar in Unicode Page 9 of 15
Notice the general order of: initial consonant cluster, vowels, tones.
For example:
1015 101E 1039 101A 1039 101F
yoûL: 1030 1038 Malay
1019 1039 101B 1039 101D 1039
]rÏm 101F 102C segmentalize
101E 1039 101A 1039 101F 1031
aoûmif 102C 1004 1039 200C top knot
Advanced Issues
So far we have covered what is explained in the Unicode Standard8. In this section we examine some of the
more difficult areas of the Myanmar language including some implementation details regarding line
breaking and sorting; further examination of the kinzi question; contractions and some issues with respect to
Old Myanmar.
Line breaking
Myanmar does not have interword spaces like English. Instead spaces are used to mark phrases. Some
phrases are relatively short (two or three syllables, 1.5em, or 2.3 times the width of U+1000 u) while others
can be quite long (8.5em or 13 times the width of U+1000 u). A common approach to addressing line
breaking issues is to adjust the phrase spacing so that a line breaks at a phrase break. If this approach fails
and a phrase must be continued onto a second line, U+200B ZERO WIDTH SPACE may be used to indicate a
possible line break point in the text.
The problem with this approach is that when phrases are quite long or a lot of text is to be typeset, the
manual adjustment of phrasing or the introduction of zero width spaces can be onerous. A further option is to
break lines automatically within phrases when needed. The clearest solution is to have a line break occurring
at a word boundary, but since there are no word breaks in Myanmar this is not immediately possible. Most
words, though, are mono-syllabic and so a mechanism of breaking lines at syllable boundaries is usually
sufficient. From this we can say that a syllable break may occur before a base consonant so long as the
consonant:
• is not devowelised with a visible virama and
• has no stacked consonant below it (ignoring true medials: –y –r –w –h) and
• is not a kinzi.
These same syllable breaking rules are used for sorting purposes, with the addition of non-line breaking
syllable breaks, such as those occuring between the two characters in a syllable chain. For example these
phrases show possible inter-syllable line breaks.
1000 1031 102C 1004 1039 200C | 101C 1031 1038
aumifav:awGausmif: | 1010 1039 101D 1031 | 1000 1039 101A 1031
the kids are
102C 1004 1039 200C 1038 | 101E 102F 102D 1037
oko
h d mG :juonf` | 101E 1039 101D 102C 1038 | 1000 1039 101B | going to school
101E 100A 1039 200C 104B
1021 102D 1015 1039 200C | 1001 1014 1039 200C
to the bedroom
f ef:wHcg:ukd
tdyc 1038 | 1010 1036 | 1001 102C 1038 | 1000 102F
door
102D
Notice how in the second example the word 1010 1036 | 1001 102C 1038 is a single word with multiple
syllables. Is there some way without a dictionary, that the we can ensure that the word is not line broken?
There is a Unicode character that was added for version 4.0: U+2060 WORD JOINER. Previous to this the
character U+FEFF ZERO WIDTH NON-BREAKING SPACE was used. Since U+FEFF is most commonly used at the
start of a Unicode text file to both identify it as being Unicode data and to indicate the encoding form of the
data, U+2060 was added to the standard to take over the function of zero width non-breaking space. The role
of this character is to indicate a non-breaking point in a text. Lines should not be broken at that point.
8
Version 4.0, 2003
Representing Myanmar in Unicode Page 10 of 15
Therefore, if we want to ensure that no line-break occurs at the syllable boundary within our poly-syllabic
word, we can insert a U+2060 into our data stream between the two syllables and a rendering engine should
not break a line at that point. Thus:
1021 102D 1015 1039 200C | 1001 1014 1039 200C
to the bedroom
f ef:wHcg:ukd
tdyc 1038 | 1010 1036 2060 1001 102C 1038 | 1000
door
102F 102D
In summary, therefore, we propose three levels of line breaking support: breaking at phrase spaces; breaking
at syllable breaks and support for polysyllabic words. A rendering engine may choose the sophistication of
line breaking support it provides.
Sorting
Sorting Myanmar strings is a complex process involving, significant string transformation and four levels of
comparison. The string transformation is a syllable based operation for which the identification of syllable
boundaries (but not word boundaries) are required. The same techniques that are used for line-breaking,
therefore, may be used for sorting.
Kinzi revisited
Consider the word taiG. How can it be represented? The normal encoding we would expect would be 1021
1004 1039 101D 1031. But there are two ways of interpreting this string:
The question is how different systems will interpret the string. One approach is to say that kinzi above a
consonant is rare and that kinzi above one of the for medial consonants (U+101A U+101B U+101D U+101F) is
very rare, and so we can say that for the sequence U+1004 U+1039 U+10xx we interpret as (U+1004
U+1039) U+10xx if U+10xx is a normal non medial consonant and that we interpret the sequence as U+1004
(U+1039 U+10xx) if U+10xx is one of the medial consonants.
But neither of these representations are the word we want, so how can we represent this word? The main
issue is whether we interpret 1021 1004 1039 101D 1031 as 1021 (1004 1039) 101D 1031 or as 1021
1004 (1039 101D) 1031?
There are different approaches we can take, the approach we propose here is to introduce either a U+200C
ZERO WIDTH NON-JOINER or a U+200D ZERO WIDTH JOINER between the U+1004 and the following U+1039.
The effect of this is to break the kinzi string and to make the U+101D look more like a medial than a main
consonant. The sub-sequence 1004 200C 1039 101D results in the rendering we want. But it also marks that
there is a syllable break between the U+1004 and the U+101D, while we want the syllable break to occur
before the U+1004. So a better solution is: 1004 200D 1039 101D. Here the ZERO WIDTH JOINER indicates
that the syllable should be held together and therefore that there is a break before the syllable. Notice that all
this talk of syllable breaks makes no difference for rendering (even line breaking). It is only needed for
sorting purposes.
If we consider all the sequences that can occur, we can see what the rendering will be and where the syllable
break will occur when sorting.
Old Myanmar
There are a few issues that storing old Myanmar text introduce.
Stacked Ya
There are occasions where a medial ya (U+101A) representation is used for a stacking ya. What is needed is a
syllable break between the base consonant and the ya. Thus we propose:
U+200C ZERO WIDTH NON-JOINER and U+200D ZERO WIDTH JOINER are more problematic. At the strictest
level of comparison, they should be included but since in many cases they are only used to control syllable
breaking for sorting a helpful approach to searching and comparison would only consider these character
when they affect rendering. U+200C affects rendering only when it follows U+1039 MYANMAR SIGN VIRAMA.
U+200D only affects rendering when it occurs in the sequence U+1004 U+200D U+1039.Conclusion
With the change to the Myanmar encoding model comes a much greater simplicity while not changing the
original character of the model which is both linguistic and practical. The model may come as a surprise to
those who are used to a glyph based encoding in which each glyph shape and position receives its own code,
or more radically each cluster receives its own code.