A Study of Myanmar Word Segmentation Schemes For Statistical Machine Translation

A Study of Myanmar Word Segmentation Schemes
for Statistical Machine Translation
Ye Kyaw Thu† Andrew Finch† Yoshinori Sagisaka‡ Eiichiro Sumita†

Multilingual Translation Laboratory, Universal Communication Research Institute,
†
National Institute of Information and Communication Technology (NICT), Japan
Global Information and Telecommunication Institute/Department of
‡
Applied Mathematics Language & Speech Science Research Laboratory (LASS), Waseda University, Japan
[email protected] [email protected] [email protected] [email protected]
Abstract operate. Myanmar language is a resource-poor language;

corpora and other language resources such as lexical and
Myanmar sentences are written as contiguous grammatical dictionaries are not yet widely available.
sequences of syllables with no characters delimiting the For this reason, developing a word segmentation tool
words. In statistical machine translation (SMT), word based on current machine learning techniques from data
segmentation is a necessary step for languages that do is a challenging task.
not naturally delimit words. Myanmar is a low-resource Recently, word segmentation has become an actively
language and therefore it is difficult to develop a good researched topic in the SMT research field. Some of the
word segmentation tool based on machine learning current research is concerned with reconsidering
techniques. In this paper, we examine various word whether or not word segmentation is really necessary for
segmentation schemes and their effect on the translation SMT [1], [2], [3]. Some research has proposed
from Myanmar to seven other languages. We performed alternatives to word-level alignment at finer granularity.
experiments based on character segmentation, syllable In [4] a character-level alignment model is proposed,
segmentation, human lexical/phrasal segmentation, and and in [5] an alignment over morphemes, the smallest
unsupervised/supervised word segmentation. The results meaningful sub-sequences of words is studied. The
show that the highest quality machine translation was motivation of this research is to investigate various
attained with syllable segmentation, and we found this Myanmar word segmentation schemes and their impact
effect to be greatest for translation into subject-object- on the quality of SMT when translating into the three
verb (SOV) structured languages such as Japanese and prevalent language classes: subject-object-verb (SOV),
Korean. Approaches based on machine learning were subject-verb-object (SVO) and verb-subject-object
unable to match this performance for most language (VSO) languages.
pairs, and we believe this was due to the lack of
linguistic resources. However, a machine learning This paper is a study of the following methods for
approach that extended syllable segmentation produced Myanmar word segmentation:
promising results and we expect this can be developed
into a viable method as more data becomes available in  Character, syllable and word segmentation
the future. schemes for Myanmar using rule based syllable
segmentation;
1. Introduction  Maximum matching-based word segmentation;
 Bayesian Pitman-Yor language model-based
unsupervised word segmentation;
In Myanmar texts, words composed of single or
 Pointwise classifier-based supervised word
multiple syllables are usually not separated by white
segmentation.
space. Although spaces are used for separating phrases
This paper also contributes the first published
for easier reading, it is not strictly necessary, and these
evaluation of the quality of automatic translations from
spaces are rarely used in short sentences. There are no
Myanmar to Japanese, Korean, Hindi, Thai, Chinese and
clear rules for using spaces in Myanmar language, and
Arabic languages.
thus spaces may (or may not) be inserted between words,
The next section describes the related research
phrases, and even between a root words and their affixes.
published in the area of word segmentation in general,
In SMT, word segmentation is a necessary step in
and Myanmar word segmentation in particular. Section
order to yield a set of tokens upon which the alignment
3 gives detailed information about all of the
and indeed the whole machine learning process can
segmentation schemes used in our experiments. Section applied for syllable segmentation and the approach
4 presents statistical information of the corpus and the achieved 100% accuracy [12] on a test set for Myanmar
translation methods used for the SMT experiments. In segmentation consisting of 16 documents containing a
Section 5, we make a detailed discussion based on the total of 23,485 words and 32,567 syllables. To
results. Finally in Section 6, we present our conclusions determine the word boundaries, dictionary based
and indicate promising avenues for future research. matching and a statistical approach using bi-grams of
syllables were combined and this achieved 98.94%
2. Related Work precision, 99.05% recall and 98.99% F-score. The
statistical approach was based on the collocation
strength of a sentence or phrase with bi-grams (i.e. two
In this section, we will briefly introduce two
syllables) extracted from the corpus. The size of the
proposed word segmentation methods, one syllable
dictionary using for dictionary-based matching was
segmentation method for Myanmar language and SMT
about 30,000 words.
and word segmentation.
Htay et al. (2008) proposed a similar 2-step longest
Many word segmentation methods have been
matching approach in which the string is first syllable-
proposed especially for the Chinese and Japanese
wise segmented, and then word segmentation is
languages. These methods can be roughly classified into
performed based on a left-to-right longest syllable
dictionary-based or rule-based and statistical methods
matching technique [13].
[6], [7], [8], [9], [10]. In dictionary-based methods, only
In their experiments, a 2-million sentence
words that are stored in the dictionary can be identified
monolingual Myanmar corpus was used together with an
and the performance depends to a large degree upon the
80K-sentence English-Myanmar parallel corpus. In
coverage of the dictionary. New words appear
addition a list of about 1200 stop words, about 4600
constantly and thus, increasing size of the dictionary is a
syllables and 800K words was used to assist the decision
not a solution to the out of vocabulary word (OOV)
process for annotating word boundaries. Their approach
problem. On the other hand, although statistical
achieved 99.11% precision, 98.81% recall and an F-
approaches can identify unknown words by utilizing
Score of 98.95% on a 50K-sentence test set.
probabilistic or cost-based scoring mechanisms, they
The two word segmentation approaches described above
also suffer from some drawbacks. The main issues are:
operate according to the same principles as the “syllable
they require large amounts of data; the processing time
breaking + Maximum Matching” word segmentation
required; and the difficulty in incorporating linguistic
approach in our experiments. We also take the same
knowledge effectively into the segmentation process
approach of using syllable breaking as the first step in
[11]. For low-resource languages such as Myanmar,
the word segmentation process for Myanmar. The main
there is no freely available corpus and dictionary based
difference is that we are not using statistical information
or rule based methods are being used as a temporary
such as bi-gram probability distributions for making our
solution. Another possible approach is to use a
decisions. Comparing to Htay et al. (2008), we did not
dictionary together with unsupervised or supervised
utilize a word list extracted from a monolingual
statistical approaches, and we analyse the effectiveness
Myanmar corpus. We explain our approach for syllable
such a technique in the experiments reported in Section
breaking in Section 3.1 and our approach for maximum
5.
matching in Section 3.2.
2.1. Myanmar Word Segmentation

2.2. Myanmar Syllable Segmentation
As far as the authors are aware there have been only
In this section, we briefly explain proposed rule-
two published methodologies for Myanmar language
based syllable segmentation method [14]. Syllable
word segmentation and both of them are rule based
segmentation rules were created based on the Myanmar
techniques that perform syllable segmentation.
syllable structure and its characteristics. Myanmar
Thet et al. (2007) proposed a two step approach in
syllable structure can be represented in B N F ( B a c k u s
which rule-based syllable segmentation is followed by
Normal Form or Backus-Naur Form) as follows:
dictionary-based statistical syllable merging using a
Syllable::=C{M}{V}{F}|C{M}V+A|C{M}{V}CA[F]|E
dictionary provided by Myanmar NLP team. Six syllable
[CA][F]|I|D
segmentation rules (single character rule, special ending
characters rule, second consonant rule, last character
Here, C=Consonants, M=Medials, V=Dependent
rule, next starter rule, miscellaneous rules for numbers,
Vowels, S=Sign Virama, A=Sign Asat or Killer,
special characters and non Myanmar characters) were
F=Dependent Various Signs, I=Independent Vowel or Unfortunately, defining word boundaries for a language
Various Signs, E=Independent Vowels, is a difficult test even for native speakers of languages
Symbols and Aforementioned, D=Digits without word segmentation such as Myanmar, Thai and
A grammar was constructed for Myanmar syllables Japanese. Several segmentation standards exist for
and a finite state acceptor was built from the grammar to developed languages such as Chinese, and the choice of
parse the Myanmar syllabic structure. Examples of a Chinese word segmentation scheme used has a large
Myanmar syllables and their syllable structure are: effect on quality of SMT [14]. It is also possible to
(“က ျား”, CMVF), (“ကက င်”, CMVVCA), (“ကက င်ျား”, proceed without any word segmentation at all, by
CMVVCAF). Syllable segmentation rules were defined representing the corpus as a sequence of individual
by comparing each pair of Myanmar character graphemes [15], [16], [17]. Some research has found
categories. To determine a possible syllable boundary that character-based SMT can achieve translation
(we represent a syllable boundary with an ‘_’ underscore accuracy comparable to word-based systems [3].
character), rules use a left context of two to four
consecutive characters, and some example rules are as 3. Segmentation Methods
follows:
This section describes the segmentation methods
Consonant + Asat Rule: No break after 1st character used in the experiments and is divided into three parts,
(e.g. consonant ka ‘က’ + Asat ‘်’ ⇒ က်) one for each class of segmentation scheme: dictionary-
based, unsupervised and supervised approaches. We first
Independent vowel + Asat Rule: Illegal spelling order, describe our method of syllable breaking that was used
no break after 1st character as a basis many of the word segmentation methods.
(e.g. Independent vowel ‘ဤ’ + Asat ‘်’ ⇒ ဤ်)
3.1. Syllable Breaking
Vowel + Consonant Rule: Unclear whether to break or
not; move to next character and the decision will Syllable breaking is a necessary step for Myanmar
become unambiguous word breaking, this is because most Myanmar words are
sequences composed of more than one syllables. Gener-
(e.g. Vowel ‘ြ’ + Consonant ‘န’ ⇒ ြန)
ally, there are only 3 rules required to break Myanmar
syllables if the input text is encoded in Unicode where
Vowel, Consonant + Asat Rule: No break after 1st
dependent vowels and other signs are encoded after the
character
consonant to which they apply. For example, the word
(e.g. Vowel ‘ိ’, Consonant ‘န’ + Asat ‘်’ ⇒ိန)် က က င် ျား (school) can be decomposed as: က + +က + +င +်+ျား .
Here, medial consonant (Ya), vowel sign က (E), vowel
Vowel, Consonant, Medial + Consonant Rule: Break sign (Aa) follow consonant က (Ka) and sign ် (Asat)
after 1st character and sign ျား (Visarga) follow syllable final consonant င
(e.g. Vowel ‘ျား’, Consonant ‘က’, Medial ‘ြ’ + Consonant (Nga). The exception to this combination rule is Kinzi,
‘သ’ ⇒ ျား_က သ) the conjunct form of U+1004 + Myanmar letter Nga,
(e.g. င +်+ ္ + ဂ for င်္ဂ in အ င်္ဂ လိ ပ် (English) word) that pre-
However, this set of rules proved insufficient to cover cedes the consonant. Therefore putting a word break in
all possible syllables and in later work (Z.M. Maung, Y. front of consonant, independent vowel, number and
Mikami, 2008), the authors extend their approach to symbol characters is the main rule and the first step for
three consecutive characters. An accuracy of 99.96% syllable breaking. The second rule removes any word
was achieved using a test corpus containing 32K breaks that are in front of subscript consonants (e.g.
syllables. removing the break point symbol “_” in front of တ ္ in
_မိ _တ္_တူ _), Kinzi characters (e.g. အ_င်္ဂ _လိ ပ် _), consonant
2.3. SMT and Word Segmentation + Asat characters (e.g. က် , န် , မ် ). The third rule is con-
cerned with break points for special cases such as sylla-
A core issue in SMT is the identification of ble combinations of loan words (e.g. ကဂ ျော့ ခ ်), Pali words ,
translation units. In phrase-based SMT these units are phonologic segmentation (e.g. တ က် _က _သိို လ် ) and ortho-
comprised of bilingual pairs consisting of sequences of graphic segmentation (e.g. တကက _သိို လ် ). These rules in our
source and target tokens (words). Therefore word experiments with a 27, 747 word dictionary achieved
segmentation (which defines the nature of these tokens) 100% segmentation precision and recall. Figure 1 shows
is one of the key preprocessing steps in SMT. examples of Myanmar syllable breaking using our tech-
nique. Figure 2 gives the pseudocode for the algorithm we
Unsegmented Segmented used to implement our proposed syllable segmentation
Input: Output: method:
အ ျားရိတယ်။ ⇒ အ ျား_ရ_ိ တယ်_။ Syllable breaking was the first step in the “syllable
breaking + Maximum Matching”, “unsupervised
အင်္ဂလိပ် ⇒ အင်္ဂ_လိပ်
segmentation”, “syllable breaking, Maximum Matching
မြန်မ ကက င်ျား ⇒ မြန်_မ _ကက င်ျား and Unsupervised” techniques we describe in the
ကိုလသမဂဂ ⇒ ကို_လ_သ_မဂဂ following sections.
Figure 1. Syllable breaking of Myanmar text
Algorithm: Myanmar Syllable Breaking

Input: array A[1..n] Comment:
A[1..n] Is a character array of a Myanmar
Output: array B[1..n*2]
Sentence
B[1..n*2] Is a character array of a syllable
j := 1; broken Myanmar sentence
char-type = NULL;
for i=1,...,n do
char-type := Check-char-type(A[i]); Comment:
if char-type = 1 then char-type = 1 for consonants, 2 for independ-
if (A[i-1] ≠ VIRAMA) and ent vowels, 3 for number and 4 for symbols
(A[i+1] ≠ ASAT) and VIRAMA = Unicode character no. U1039 and
(A[i+1] ≠ VIRAMA) then ASAT = Unicode character no. U103A
B[j] := ‘_’;
B[j+1] := A[i];
else
B[j] := A[i]; Comment:
we can add one more else if for loan words,
else if char-type = 2 or char-type = 3 or char-type = 4 then
Pali words, phonologic and orthographic seg-
B[j] := ‘_’ mentation
B[j+1] := A[i];
else
B[j] := A[j];
j := j +1;
Figure 2. Myanmar Syllable Breaking Algorithm
3.2. Maximum Matching Bayesian Pitman-Yor Language model-based strategy in

our experiments. Two models were trained. One using
The Maximum Matching algorithm is a structural syllable sequences, and the other using the output of the
segmentation algorithm often used as a baseline method syllable segmentation + Maximum Matching method.
in word segmentation as it typically achieves a
respectable level of performance [18], [19]. This 3.4. Supervised Segmentation
algorithm first generates all possible segmentations for a
sentence using a dictionary and then selects the one that We used the publicly available KyTea toolkit to
contains either the longest words or smallest number of perform supervised Myanmar word segmentation based
words. It is a greedy algorithm and is therefore sub- on a pointwise prediction algorithm [22]. A manually
optimal. The segmentation process may proceed from segmented corpus of varying size (from 100 sentences
left-to-right or from right-to-left. In this paper, we used to 12,000 sentences) retrieved from the development
left-to-right Maximum Matching using a [27,747- data set without POS or pronunciation tags was used to
word] Myanmar word list extracted from a Myanmar- train the models. Although there are no standard word
English dictionary [20]. segmentation rules for the Myanmar language yet, we
defined a simple set of basic segmentation rules for
3.3. Unsupervised Segmentation manual segmentation for this experiment. The rules are
listed below are applied to the data in the same order
We used the publicly available latticelm tool [21] to they are given here.
perform unsupervised word segmentation using a
RULE 1: Segment Word Units: a word unit is a 4.1. Corpus statistics
meaningful unit that could be a candidate for an entry in
a lexicon. (e.g. တနင်္လ ကနေ့ to တနင်္လ _ကနေ့, ဒီဇင်ဘ လ to In Section 5, we will present translation evaluation
ဒီဇင်ဘ _လ, ၂၄ရက် to ၂_၄_ရက်, ပါသလ ျား to ပါ_သလ ျား, ဒါကပမျော့ to results for a Myanmar to other languages with various
ဒါကပမျော့) segmentation schemes. We used the multilingual Basic
RULE 2: Segment Combined words: Combined Travel Expressions Corpus (BTEC), which is a
words are segmented as a single token. (e.g. ဘိုရ ျားကက င်ျား collection of travel-related expressions [23]. Developing
(church), အကက ျားဝယ်ကဒ် (credit card), ဘတ်စ်က ျား (Bus car), Myanmar language data for BTEC is currently a work in
ကအ ်ဒါမ (order), ရပ်က ည်ျော့ (stand and watch)) progress and we used the 72,651 Myanmar sentences for
which translation has been completed. Myanmar is used
RULE 3: Segment Affixes: insert spaces in-between
as the source language in all the experiments and the
affixes (prefix or suffix) and root words (e.g. ကနချော့ to ကန_ခ,ျော့
corresponding translated sentences for Japanese (ja),
အတိရကက င်ျား to အတိရ_ကက င်ျား, လ က ိြိုလိေ့ို to လ က ိြို_လိေ့ို ,
Korean (ko), Hindi (hi), English (en), Thai (th), Chinese
ကလျော့က င်ျော့တျော့ to ကလျော့က င်ျော့_တျော့, မြန်မြန်လိုပ် to မြန်မြန်_လိုပ်).
(zh) and Arabic (ar) were used to build a set of bilingual
corpora. For Hindi, we used both the Devanagari script
We didn’t make any correction of spelling and
and a Romanized form (which has a different word
encoding mistakes for maintain consistency with other
segmentation). The corpus statistics for the source
segmentation methods used in our experiments. (e.g.
language, Myanmar, are summarized in Table1 and for
က င် (correct word: ကစ င်), လိေ့ို ခ င် (correct word: လိခို င်),
the target languages are in Table 2.
ဗိုြိုဒဓဟူျား (correct word: ဗိုဒဓဟူျား), နရ (correct word: ကနရ ))
4. SMT Experiments
Table 1: Language resources of Myanmar (number of tokens per segmentation method)
Average Syllable
Segmentation Methods Train Development Test
per Token
Human Translator 151,829 24,273 2,267 5.45
Character Breaking 2,301,184 339,545 33,449 0.36
Syllable Breaking 835,030 123,961 12,654 1.00
Syllable + Maximum Matching 718,874 103,447 10,206 1.17
Unsupervised (3 gram) 565,304 81,536 8,299 1.48
Syl, Max Match, Unsupervised (3 gram) 526,203 75,082 7,495 1.60
Semi-Supervised (100 sentences) 527,052 79,955 7,943 1.58
Table 1 shows the number of tokens and average Word Separator (EDWS) and defined the segmentation
syllable per token resulting from each of the word precision, recall and harmonic mean F as follows:
segmentation schemes. Here, 3-gram, 4-gram, 5-gram,
6-gram and 7-gram specify the n-gram order of the Precision = (no. of Sub)/(no. of separators in Hyp)
language model and spelling model used in the Recall = (no. of sub)/(no. of separators in Ref)
unsupervised model. F = 2*Prec*Recall/(Prec+Recall)
Here,
4.2. Word Segmentation Methods Sub = substitutions,
Hyp = Hypothesis,
In the SMT experiments from Myanmar to other Ref = Reference
languages, we compare the following segmentation Clearly, “Character breaking” gives the maximum
methods: number of words: 32,748 words, and “Syllable
Breaking” gives the second highest number of words:
Translation with human translators’ segmentation: 12,545 words. The syllable breaking followed by the
The BTEC corpus for Myanmar contains some word Maximum Matching method gives 10,202 words and
segmentation added by human translators during “Human Translator” gives the lowest number of words:
translation. These word boundaries were added naturally 1,985 words. “Syllable + Maximum Matching” gives
by the annotators while creating the corpus, and due to the highest F1 score. The lowest F1 score of 0.23 was
the nature of the language are quite sparse. given by Human Translator’s segmentation. This is
because human translators rarely put space between
Translation with character breaking: Each Myanmar words especially for short sentences. 100% recall can
character is interpreted as a single word. be achieved by Syllable Breaking and Character
Breaking. The supervised method was trained on the
Syllable Breaking: Each Myanmar syllable is test-data for this experiment and thus, we do not show
interpreted as a single word. the result of F1 measurement on supervised methods in
Table 3.
Syllable Breaking + Maximum Matching: First The same Myanmar sentence will be segmented in
syllable breaking was done; then Maximum Matching radically different ways depending on the segmentation
word segmentation was done. method. Figure 3 shows the some different
segmentations of a Myanmar sentence sampled from
Unsupervised Word Segmentation: First syllable development data. Figure 4 shows an example of word
breaking was done; then the syllable-segmented corpus alignment of Myanmar (Syllable Breaking) and English
was segmented using latticelm (with 3-gram to 7-gram (word breaking) Sentence pair.
language models depending on the experiment).
4.3. Phrase-based Statistical Machine
Syllable Breaking, Maximum Matching and Translation
Unsupervised Word Segmentation: First syllable
breaking was done. Second, Maximum Matching word The Myanmar source segmented by each of the
segmentation was done on the syllable-segmented segmentation methods that we described in Section 4.2
corpus. Finally, the syllable-segmented corpus was is aligned to the word segmented target languages
segmented using latticelm (with 3-gram to 7-gram (Japanese, Hindi (Romanized), Hindi (Devanagari),
language models depending on the experiment). English, Thai, Chinese and Arabic) using GIZA++ [25].
Language modeling is done using the IRSTLM version
Supervised Word Segmentation: First manual 5.80.01 [26]. Minimum error rate training (MERT) was
segmentation was done; then the syllable-segmented used to tune the decoder’s parameters and the decoding
corpus was segmented with KyTea. Twelve experiments is done using the phrase-based SMT system MOSES
were performed in total using different amounts of version 0.91 [27].
manually-segmented data to train KyTea (ranging from
100 to 12,000 sentences). 4.4. Evaluation Criteria
We calculated the F-score [24] for each segmentation
method based on 1,000 manually segmented sentences We used two automatic criteria for the evaluation of
of Myanmar (see Table 3). We used Edit Distance of the the SMT. One is the de facto standard automatic
evaluation metric Bilingual Evaluation Understudy
(BLEU) [28] and the other is the Rank-based Intuitive the highest BLEU scores are on (4 gram to 6 gram) and
Bilingual Evaluation Measure (RIBES) [29]. BLEU the highest RIBES scores range from (3 gram to 7 gram).
score measures the precision of 1-grams to 4-grams with A possible explanation is that segmentation quality has a
respect to a reference translation with a penalty for short high degree of variance, and depends on the training
sentences [28]. The BLEU score approximately parameters or initial conditions. Nonetheless, the second
measures the adequacy of SMT and large BLEU scores highest BLEU score (my-ja, 33.45) was achieved by the
are better. RIBES is an automatic evaluation metric “Unsupervised (6 gram)” segmentation model and the
based on rank correlation coefficients modified with second highest RIBES score (my-ja, 0.819) is given by
precision and special care is paid to word order of the “Unsupervised (4 gram) segmentation model. These are
translation results. The RIBES is suitable for distant encouraging results and we believe that this method may
language pairs such as Myanmar and English [29]. the potential to achieve respectable levels of
Large RIBES scores are better. We calculated the performance given sufficient data.
Pearson product-moment correlation coefficient When we analyze the SMT quality of the supervised
(PMCC) between BLEU and F1, and RIBES and F1 to approach, as might be expected we see a strong
assess the strength of the linear relationship between dependence on the quantity of data used to train the
segmentation schemes and quality of SMT. segmentation model. These are come variance in the
results but the better results in terms of both BLEU and
5. Results RIBES scores occurring in the 500-1200 sentence range.
The manually segmented corpus was quite small, and
5.1. Discussion although this approach was unable to match the best
performing methods, it came reasonably close using all
We divided the experiments into three groups (rule the data, and was still improving at this point. We
based, unsupervised and supervised) and divided the therefore expect that pursuing supervised segmentation
target languages into three groups (SOV, SVO and VSO) could lead to a viable method of segmentation for low-
for making the comparison. We highlighted the table resource languages if more manually segmented data
cells of maximum BLEU and RIBES scores for each were available.
target language in Tables 4 to 9. The lowest BLEU scores and RIBES scores were
The results show that “Syllable Breaking” given by “Human Translator” segmentation. The reason
segmentation consistently gives the best BLEU and for this is that most of the translated sentences of the
RIBES scores for all language pairs. The reason might BTEC corpus have no segmentation. The human
be that with syllable segmentation very few errors are translators added this information only sparingly. From
made. As mentioned in Sections 2.2 and 3.1, the these results, we conclude that the partial segmentation
syllables themselves can be delimited with close to provided by the human translators is insufficient to
100% accuracy, and it is in principle possible to group provide useful gains for SMT, at least using the
these syllables to form the words in Myanmar without methodology we adopted in our experiments.
error. Increasing the granularity of the segmentation Visual inspection of the absolute values of the BLEU
above this level can introduce errors in which the and RIBES metrics, would seem to indicate that SMT
sequences of syllables do not constitute a word. For from Myanmar (an SOV language) to SOV languages
example sequences of syllables in erroneous segmented can lead to higher quality translations than from
‘words’ may contain syllables from more than one true Myanmar to SVO and VSO language pairs. This is
word in the language. intuitive since the task of re-ordering is considerably
“Syllable + Maximum Matching” segmentation simpler in this case.
method also consistently give rise to high BLEU and From the overall results, we can make conclusion
RIBES scores. As we mentioned in Section 3.2 that “Syllable Breaking” and “Syllable + Maximum
Maximum Matching is using a dictionary for left-to- Matching” achieved the higher BLEU and RIBES scores
right word segmentation over segmented syllables. than other segmentation methods. We can expect that we
Although this segmentation method can make incorrect can increase these scores higher than current results in
decisions during segmentation, we believe that its error the near future. Both BLEU and RIBES scores proved
rate is low relative to the data-driven methods. This is that the relationship of word segmentation and SMT.
supported by the precision and recall figures shown in
Table 3. 5.2 Error Analysis
When we analyze the “Unsupervised” segmentation
method, the results were quite inconsistent. Generally, Figure 5 shows two examples of translation output to
illustrate how errors in segmentation have an effect on
translation quality. Figure 5 (a), is an example of an 6. Conclusion
alignment error occurred in character breaking of the
Myanmar sentence “ဂ ပန်လူမ ိြိုျား စ ကရျား ရ ကိို ကက င်ျားကက င်ျား In this paper, we investigated the effectiveness of
သိပါသလ ျား။” (Are you familiar with Japanese authors?) in seven Myanmar word segmentation schemes for SMT.
my-ja SMT. Here, the “က” character from the word This paper also contributes the first SMT evaluation
“ကက င်ျားကက င်ျား” is mistakenly mapped to the Japanese from Myanmar to the Japanese, Korean, Hindi, Thai,
word “ は ”. This kind of error occurred in the Chinese and Arabic languages. We built character,
“Character Breaking” segmentation because the syllable and word segmentation schemes for Myanmar
Myanmar character “က” is usually correctly aligned to by using rule based syllable segmentation, maximum
the frequently occurring Japanese particle “は” and this matching based word segmentation, “Bayesian Pitman-
character occurs several times in the Myanmar word Yor” language model based unsupervised word
“ကက င်ျားကက င်ျား”; the system preferred to incorrectly segmentation and “pointwise classifier” based
supervised word segmentation. In most of our
translate “က” as the frequent option “は”. Figure 5 (b)
experiments, the “Syllable Breaking” technique
is an example of a similar alignment error that occurred
achieved the highest SMT evaluation scores in both
in “Syllable + Maximum Matching” segmentation of the
BLEU and RIBES, but we believe that as more more
Myanmar sentence “ဂ င်ဂ ကအျားလ် ကိို ကပျားပါ။” (A ginger ale,
data becomes available both the unsupervised and
please). Here, the Myanmar syllables “ဂ င်” and “ဂ ” are supervised approaches should improve sufficiently to
mistakenly aligned with English words “Jim” and become useful. We propose an elegant new algorithm
“German”. This kind of error occurred in the “Syllable for Myanmar syllable breaking that is simple to
Breaking” segmentation because the word contains implement, has high coverage, and is very accurate. We
other Myanmar word and syllable, namely “ဂ င်” and believe it will be easy to adapt to related Asian syllabic
“ဂ ” which can be translated as “Jim” and “German” languages such as Khmer, Laos, and Nepali. We plan to
respectively. These words are frequently in the corpus extend our study on syllable breaking using extensions
relative to the word for “ginger ale”. A segmenter that of the unsupervised and supervised segmentation
was capable of identifying the word as a single unit methods presented in this paper in the near future.
would have avoided this error.
Figure 6 shows the Pearson product-moment References
correlation coefficient (PMCC) between BLEU scores
of “Syllable Breaking”, “Syllable + Maximum [1] Jing Sun and Yves Lepage (2012), “Can Word
Matching”, “Unsupervised (3 gram to 7 gram)” and “Syl, Segmentation be Considered Harmful for Statistical Machine
Max Match, Unsupervised (3 gram to 7 gram)” Translation Tasks between Japanese and Chinese?”, In
segmentations and F1 of Myanmar to English SMT. We Proceedings of 26th Pacific Asia Conference on Language,
got similar PMCC graphs for Myanmar to other Information and Computation, pp. 351-360.
languages pairs. From the graphs, there appears to be a [2] Jia Xu, Richard Zens and Hermann Ney (2004), “Do We
Need Chinese Word Segmentation for Statistical Machine
moderate level of correlation (e.g. 0.739 for my-en,
Translation?”, In Proceeding of the Third Sighan Workshop on
0.517 for my-ja, 0.555 for my-ko) between the F-score
Chinese Language Learning, pp. 122-128.
for the segmentation quality and the BLEU score. [3] Graham Neubig, Taro Watanabe, Shinsuke Mori, Tatsuya
Although we can show the relationship between the Kawahara (2012), Machine Translation without Words through
word segmentation and the SMT results, it is still very Substring Alignment, In Proceeding of the 50th Annual
hard to make an analysis and formulate measures to Meeting of the Association for Computational Linguistics, pp.
describe the deep relationship between them. This is due 165-174.
to the complexity of the SMT process; the quality of [4] Ning Xi, Guangchao Tang, Xinyu Dai, Shujian Huang,
SMT depends on many factors relating to alignment, re- Jiajun Chen (2012), “Enhancing Statistical Machine
Translation with Character Alignment”, In Proceedings of the
ordering and so on.
50th Annual Meeting of the Association for Computational
As we mentioned in Section 4.1, BTEC corpus
Linguistics, pp. 285-290.
is also still being created and currently contains many [5] Jason Naradowsky and Kristina Toutanova (2011),
errors such as spelling mistakes, translation errors, and “Unsupervised Bilingual Morpheme Segmentation and
problems with the grammar. The SMT evaluation scores Alignment with Context-rich Hidden Semi-Markov Models”,
presented in this paper therefore represent a lower In Proceedings of the 49th Annual Meeting of the Association
bound on what is possible with a larger, cleaner corpus. for Computational Linguistics, pp. 895-904.
[6] Zinmin Wu, Gwyneth Tseng, “Chinese text segmentation
for text retrieval: Achievements and problems”, Journal of the
American Society for Information Science (JASIS), 44(9): pp. [19] Pak-kwong Wong and Chorkin Chan (1996), Chinese
532-542. Word Segmentation based on Maximum Matching and Word
[7] Sun, M., D. Shen and B. K. Tsou (1998), “Chinese word Binding Force, In Proceedings of the 16th conference on
segmentation without using lexicon and hand-craft training Computational Linguistics, Volume 1, pp. 200-203.
data. In Proceeding of COLING-ACL 98, pp. 1265-1271. [20] Department of the Myanmar Language Commission
[8] Constantine P. Papageorgiou, “Japanese Word (1993), Myanmar-English Dictionary, Yangon, Ministry of
Segmentation By Hidden Markov Model”, in Proceeding of a Education.
workshop held at Plainsboro, New Jersey, March 8-11, 1994, [21] Graham Neubig, Masato Mimura, Shinsuke Mori, Tatsuya
pp. 283-288. Kawahara (2010), “Learning a Language Model from
[9] Daichi Mochihashi, Takeshi Yamada, Naonori Ueda, Continuous Speech”, In Proceedings of InterSpeech 2010, pp.
“Bayesian Unsupervised Word Segmentation with Nested 1053-1056.
Pitman-Yor Language Modeling”, in Proceeding of the Joint [22] Graham Neubig, Yosuke Nakata, Shinsuke Mori (2011),
Conference of the 47th Annual Meeting of the ACL and the 4th Pointwise Prediction for Robust, Adaptable Japanese
International Joint Conference on Natural Language Morphological Analysis, In Proceeding of the 49th Annual
Processing of the AFNLP: Volume 1-Volume 1. ACL, 2009, pp. Meeting of the Association for Computational Linguistics:
100-108. Human Language Technologies (ACL-HLT), short papers
[10] Chang Jyun-Shen, C.-D. Chen and Shun-De Chen Volume2, pp. 529-533
“Chinese Word Segmentation through constraint satisfaction [23] Genichiro Kikui, Seiichi Yamamoto, Toshiyuki Takezawa,
and statistical optimization”, in Proceeding of ROCLING IV, and Eiichiro Sumita (2006), Comparative study on corpora for
ROCLING, pp. 147-165. speech translation. In IEEE Transactions on Audio, Speech
[11] Teahan, W. J., Yingying Wen, Rodger McNad and Ian and Language, 14(5), pp.1674-1682
Witten, 2000, “A compression-based algorithm for Chinese [24] Chinese Word Segmentation Evaluation Toolkit
word segmentation, In Computational Linguistics, 26 (3), pp. http://projectile.sv.cmu.edu/research/public/tools/segmentation
375-393. /eval/index.htm
[12] Tun Thura Thet, Jin-Cheon Na, Wunna Ko Ko (2008), [25] Franz Och and Hermann Ney (2000), Improved Statistical
“Word Segmentation for the Myanmar language”, Journal of Alignment Models, In Proceeding of the 38th Annual Meeting
Information Science 34(5), pp. 688-704. of the Association for Computational Linguistics (ACL), pp.
[13] Hla Hla Htay, Kavi Narayana Murthy (2008), Myanmar 440-447.
Word Segmentation Using Syllable Level Longest Matching, [26] Marcello Federico and Mauro Cettolo (2007), Efﬁcient
the 6th Workshop on Asian Language Resources 2008, pp. 41- Handling of N-gram Language Models for Statistical Machine
48. Translation, In Proceedings of the Second Workshop on
[14] Zin Maung Maung, Yoshiki Mikami (2008), A Rule-based Statistical Machine Translation, pp. 88–95.
Syllable Segmentation of Myanmar Text, IJCNLP-08 (The [27] MOSES, 2007. A Factored Phrase-based Beam-search
Third International Joint Conference on Natural Language Decoder for Machine Translation. URL:
Processing), Workshop on NLP for Less Privileged Languages, http://www.statmt.org/moses/.
2008, pp. 51-58. [28] Kishore Papineni, Salim Roukos, Todd Ward, and Wei
[15] Pi-Chuan Chang, Michel Galley, and Christopher D. Jing Zhu (2002), BLEU: a Method for Automatic Evaluation
Manning (2008), Optimizing Chinese word segmentation for of Machine Translation. In Proceedings of the 40th Annual
machine translation performance, in ACL 2008 Third Meeting of the Association for Computational Linguistics
Workshop on Statistical Machine Translation, pp.224-232. (ACL), Philadelphia, USA , pages 311-318.
[16] David Vilar, Jan-Thorsten Peter and Hermann Ney (2007), [29] Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito
Can we translate letters, In Second Workshop on Statistical Sudoh, Hajime Tsukada (2010), Automatic Evaluation of
Machine Translation, Prague, Czech Republic, June 2007, pp. Translation Quality for Distant Language Pairs, In
33-39 Proceedings of the 2010 Conference on Empirical Methods on
[17] Preslav Nakov, Jorg Tiedemann (2012), “Combining Natural Language Processing (EMNLP), Oct. 2010, pp. 944-
Word-Level and Character-Level Models for Machine 952.
Translation Between Closely-Related Languages”, In
Proceeding of the 50th Annual Meeting of the Association for
Computational Linguistics (ACL), pp. 301-305.
[18] Yuan Liu, Qiang Tan, and Kun Xu Shen. 1994. The Word
Segmentation Methods for Chinese Information Processing (in
Chinese). Quing Hua University Press and Guang Xi Science
and Technology Press, Page 36.
Table 2: Language Resources of Japanese (ja), Korean (ko), Hindi (Romanized), Hindi (Devanagari), English (en), Thai (th),
Chinese (zh) and Arabic (ar)
Target Train Development Test

Languages Words Sentences Words Sentences Words Sentences
ja 594,127 61,651 95,727 10,000 9,266 1,000
ko 559,243 61,651 89,519 10,000 8,777 1,000
hi (R) 545,931 61,651 92,456 10,000 8,123 1,000
hi (D) 465,423 61,651 80,108 10,000 6,920 1,000
en 527,268 61,651 86,934 10,000 7,901 1,000
th 512,054 61,651 86,401 10,000 8,811 1,000
zh 485,151 61,651 77,101 10,000 7,711 1,000
ar 447,799 61,651 72,436 10,000 7,128 1,000
Table 3: Number of words, precision, recall and F-1 scores of segmentation methods calculated on manually segmented
1000 sentences
Segmentation Methods Words Precision Recall F-1

Human Translator 1,985 99.59% 13.11% 0.23
Character Breaking 32,706 23.60% 100.00% 0.38
Syllable Breaking 12,545 64.82% 100.00% 0.79
Syllable + Maximum Matching 10,202 80.19% 98.60% 0.88
Unsupervised (3 gram) 7,718 75.35% 67.64% 0.71
Syl, Max Match, Unsupervised (3 gram) 7,224 82.34% 68.48% 0.75
Human Translator
ဟိိုနေ့ိို လူလူကကနတိက
ို ိြိုအထိထငိို ်ခက
ိုုံ ိို_က ိြိုတင်စ ရင်ျားကပျားသွငျား် ခ င်ပါတယ်။
Character Breaking
ဟ_ိ_ြို_န_ိ_ြို_ျော့_လ_ူ_လ_ူ_က_န_က_တ_ိ_ြို_က_ _ိ_ြို_အ_ထ_ိ_ထ_ိ_ြို_င_်_ခ_ြို_ုံ_က_ိ_ြို_က_ြ_ိ_ြို_တ_င_်_စ_ _ရ
_င_်_ျား_ပ_က_ျား_သ_ွ_င_်_ျား_ခ_ _င_်_ပ_ါ_တ_ယ_်_။
Syllable Breaking
ဟိ_ို နိိုေ့_လူ_လူ_က_ကန_တိ_ို က ိြို_အ_ထိ_ထိငို _် ခို_ုံ ကိ_ို က ိြို_တင်_စ _ရင်ျား_ကပျား_သွငျား် _ခ င်_ပါ_တယ်_။
Syllable + Maximum Matching

ဟိ_ို နိိုေ့_လူ_လူ_က_ကန_တိ_ို က ိြို_အထိ_ထိငို ်ခိုုံ_ကိ_ို က ိြိုတင်_စ ရင်ျား_ကပျား_သွငျား် _ခ င်_ပါ_တယ်_။
Unsupervised (3-gram)
ဟိ_ို နိိုေ့_လူလူ_ကကန_တိက
ို ိြို_အထိ_ထိငို ်ခိုုံကိို_က ိြို_တင်စ _ရင်ျားကပျားသွငျား် ခ င်ပါ_တယ်။
Unsupervised (7-gram)
ဟိ_ို နိိုေ့_လူ_လူ_ကကန_တိက
ို ိြို_အထိ_ထိငို ်ခ_ိုုံ ကိ_ို က ိြိုတင်စ _ရင်ျား_ကပျား_သွငျား် ခ င်_ပါ_တယ်_။
Syllable, Maximum Matching, Unsupervised (4-gram)

ို ိြို_အထိ_ထိငို ်ခိုုံ_ကိ_ို က ိြိုတင်_စ ရင်ျား_ကပျားသွငျား် ခ င်ပါ_တယ်။
Syllable, Maximum Matching, Unsupervised (6-gram)

ို ိြို_အထိ_ထိငို ်ခိုုံ_ကိ_ို က ိြိုတင်_စ ရင်ျားကပျားသွင်ျားခ င်_ပါ_တယ်။
Supervised (100 sentences)

ဟိ_ို နိိုေ့_လူ_လူ_က_ကနတိ_ို က ိြို_အထိ_ထိိုင်ခို_ုံ_ကိို_က ိြိုတင်_စ ရင်ျား_ကပျားသွငျား် _ခ င်_ပါတယ်_။
Supervised (1200 sentences)

ဟိိုနေ့ိို _လူလ_ူ ကကန_တိက
ို ိြို_အထိ_ထိငို ်ခ_ိုုံ ကိ_ို က ိြိုတင်_စ ရင်ျားကပျားသွင်ျားခ င်_ပါတယ်_။
Figure 3. Different segmentations for a Myanmar sentence
ဟိို နိေ့ို လူ လူ က ကန တိို က ြိုိ အ ထိ ထိိုင် ခိုုံ ကိို က ြိုိ တင် စ ရင်ျား ကပျား သွင်ျား ခ င် ပါ တယ် ။
I ‘d like to reserve a seat from Honolulu to Tokyo .
Figure 4. A syllable-to-word aligned Myanmar-English sentence pair
(the above Myanmar sentence is the same sentence as in Figure 3)
Table 4: BLEU scores for Human Translator, Character Breaking, Syllable Breaking and Syllable + Maximum Matching
segmentation
SOV SVO VSO

Segmentation Method ja ko hi (R) hi (D) en th zh ar
Human Translator 8.66 8.34 2.12 1.36 3.52 1.88 5.86 1.67
Character Breaking 27.64 25.36 6.68 3.35 6.95 5.57 12.76 8.08
Syllable Breaking 35.17 31.88 8.62 5.40 13.53 11.35 22.21 10.29
Syllable + Maximum Matching 34.58 32.39 8.48 5.60 14.93 12.74 23.09 9.99
Table 5: BLEU scores for Unsupervised (3 to 7 gram), Syl, Max Match, Unsupervised (3 to 7 gram) segmentation
SOV SVO VSO

Unsupervised (3 gram) 32.96 30.67 7.46 4.81 13.30 12.29 20.91 9.59
Syl, Max Match, Unsupervised (3 gram) 32.44 30.56 7.58 5.58 13.83 12.35 22.10 10.56
Table 6: BLEU scores for supervised segmentation (from 100 to 1200 sentences)
SOV SVO VSO

Supervised (100 sentences) 29.25 27.54 6.23 3.96 10.66 9.05 17.69 9.01
Table 7: RIBES scores for Human Translator, Character Breaking, Syllable Breaking and Syllable + Maximum Matching
segmentation
SOV SVO VSO

Segmentation Method ja ko hi (R) hi (S) en th zh ar
Human Translator 0.289 0.286 0.192 0.099 0.209 0.142 0.235 0.125
Character Breaking 0.781 0.741 0.570 0.336 0.516 0.381 0.654 0.415
Syllable Breaking 0.837 0.790 0.617 0.385 0.623 0.501 0.741 0.472
Syllable + Maximum Matching 0.822 0.794 0.608 0.408 0.627 0.529 0.745 0.465
Table 8: RIBES scores for Unsupervised (3 to 7 gram), Syl, Max Match, Unsupervised (3 to 7 gram) segmentation
SOV SVO VSO

Segmentation Method ja ko hi (R) hi (D) en th zh Ar
Table 9: RIBES scores for supervised segmentation (from 100 to 1200 sentences)
SOV SVO VSO

(a) (b)
Figure 5. Two examples of translation errors caused by segmentation methods,

(a) with Character Breaking for my-ja, (b) with Syllable, Maximum Matching for my-en
Figure 6. The correlation between BLEU and segmentation F-score for my-en

A Study of Myanmar Word Segmentation Schemes For Statistical Machine Translation

Uploaded by

Copyright:

Available Formats

A Study of Myanmar Word Segmentation Schemes For Statistical Machine Translation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Study of Myanmar Word Segmentation Schemes For Statistical Machine Translation

Uploaded by

Copyright:

Available Formats

A Study of Myanmar Word Segmentation Schemes

for Statistical Machine Translation

Ye Kyaw Thu† Andrew Finch† Yoshinori Sagisaka‡ Eiichiro Sumita†

Abstract operate. Myanmar language is a resource-poor language;

2.1. Myanmar Word Segmentation

Algorithm: Myanmar Syllable Breaking

Figure 2. Myanmar Syllable Breaking Algorithm

3.2. Maximum Matching Bayesian Pitman-Yor Language model-based strategy in

Table 1: Language resources of Myanmar (number of tokens per segmentation method)

Target Train Development Test

Segmentation Methods Words Precision Recall F-1

Syllable + Maximum Matching

Syllable, Maximum Matching, Unsupervised (4-gram)

Syllable, Maximum Matching, Unsupervised (6-gram)

Supervised (100 sentences)

Supervised (1200 sentences)

Figure 3. Different segmentations for a Myanmar sentence

I ‘d like to reserve a seat from Honolulu to Tokyo .

Figure 4. A syllable-to-word aligned Myanmar-English sentence pair

(the above Myanmar sentence is the same sentence as in Figure 3)

SOV SVO VSO

SOV SVO VSO

SOV SVO VSO

SOV SVO VSO

SOV SVO VSO

SOV SVO VSO

Figure 5. Two examples of translation errors caused by segmentation methods,

You might also like