A Study of Myanmar Word Segmentation Schemes For Statistical Machine Translation
A Study of Myanmar Word Segmentation Schemes For Statistical Machine Translation
A Study of Myanmar Word Segmentation Schemes For Statistical Machine Translation
4. SMT Experiments
Average Syllable
Segmentation Methods Train Development Test
per Token
Human Translator 151,829 24,273 2,267 5.45
Character Breaking 2,301,184 339,545 33,449 0.36
Syllable Breaking 835,030 123,961 12,654 1.00
Syllable + Maximum Matching 718,874 103,447 10,206 1.17
Unsupervised (3 gram) 565,304 81,536 8,299 1.48
Unsupervised (4 gram) 577,159 83,855 108,893 1.26
Unsupervised (5 gram) 575,428 84,530 8,564 1.45
Unsupervised (6 gram) 567,322 83,328 8,431 1.47
Unsupervised (7 gram) 573,244 84,965 8,511 1.46
Syl, Max Match, Unsupervised (3 gram) 526,203 75,082 7,495 1.60
Syl, Max Match, Unsupervised (4 gram) 527,216 75,536 7,464 1.59
Syl, Max Match, Unsupervised (5 gram) 526,794 76,010 7,483 1.59
Syl, Max Match, Unsupervised (6 gram) 526,742 75,814 7,568 1.59
Syl, Max Match, Unsupervised (7 gram) 526,803 75,982 7,595 1.59
Semi-Supervised (100 sentences) 527,052 79,955 7,943 1.58
Semi-Supervised (200 sentences) 541,722 81,210 8,041 1.54
Semi-Supervised (300 sentences) 551,389 81,457 8,017 1.52
Semi-Supervised (400 sentences) 546,530 80,352 7,964 1.53
Semi-Supervised (500 sentences) 560,899 82,054 8,114 1.49
Semi-Supervised (600 sentences) 568,567 83,238 8,200 1.47
Semi-Supervised (700 sentences) 554,313 80,406 8,054 1.51
Semi-Supervised (800 sentences) 551,787 80,713 7,992 1.52
Semi-Supervised (900 sentences) 550,423 79,865 7,924 1.52
Semi-Supervised (1000 sentences) 551,327 80,208 7,953 1.52
Semi-Supervised (1100 sentences) 509,566 80,416 7,996 1.62
Semi-Supervised (1200 sentences) 515,162 80,534 8,118 1.61
Table 1 shows the number of tokens and average Word Separator (EDWS) and defined the segmentation
syllable per token resulting from each of the word precision, recall and harmonic mean F as follows:
segmentation schemes. Here, 3-gram, 4-gram, 5-gram,
6-gram and 7-gram specify the n-gram order of the Precision = (no. of Sub)/(no. of separators in Hyp)
language model and spelling model used in the Recall = (no. of sub)/(no. of separators in Ref)
unsupervised model. F = 2*Prec*Recall/(Prec+Recall)
Here,
4.2. Word Segmentation Methods Sub = substitutions,
Hyp = Hypothesis,
In the SMT experiments from Myanmar to other Ref = Reference
languages, we compare the following segmentation Clearly, “Character breaking” gives the maximum
methods: number of words: 32,748 words, and “Syllable
Breaking” gives the second highest number of words:
Translation with human translators’ segmentation: 12,545 words. The syllable breaking followed by the
The BTEC corpus for Myanmar contains some word Maximum Matching method gives 10,202 words and
segmentation added by human translators during “Human Translator” gives the lowest number of words:
translation. These word boundaries were added naturally 1,985 words. “Syllable + Maximum Matching” gives
by the annotators while creating the corpus, and due to the highest F1 score. The lowest F1 score of 0.23 was
the nature of the language are quite sparse. given by Human Translator’s segmentation. This is
because human translators rarely put space between
Translation with character breaking: Each Myanmar words especially for short sentences. 100% recall can
character is interpreted as a single word. be achieved by Syllable Breaking and Character
Breaking. The supervised method was trained on the
Syllable Breaking: Each Myanmar syllable is test-data for this experiment and thus, we do not show
interpreted as a single word. the result of F1 measurement on supervised methods in
Table 3.
Syllable Breaking + Maximum Matching: First The same Myanmar sentence will be segmented in
syllable breaking was done; then Maximum Matching radically different ways depending on the segmentation
word segmentation was done. method. Figure 3 shows the some different
segmentations of a Myanmar sentence sampled from
Unsupervised Word Segmentation: First syllable development data. Figure 4 shows an example of word
breaking was done; then the syllable-segmented corpus alignment of Myanmar (Syllable Breaking) and English
was segmented using latticelm (with 3-gram to 7-gram (word breaking) Sentence pair.
language models depending on the experiment).
4.3. Phrase-based Statistical Machine
Syllable Breaking, Maximum Matching and Translation
Unsupervised Word Segmentation: First syllable
breaking was done. Second, Maximum Matching word The Myanmar source segmented by each of the
segmentation was done on the syllable-segmented segmentation methods that we described in Section 4.2
corpus. Finally, the syllable-segmented corpus was is aligned to the word segmented target languages
segmented using latticelm (with 3-gram to 7-gram (Japanese, Hindi (Romanized), Hindi (Devanagari),
language models depending on the experiment). English, Thai, Chinese and Arabic) using GIZA++ [25].
Language modeling is done using the IRSTLM version
Supervised Word Segmentation: First manual 5.80.01 [26]. Minimum error rate training (MERT) was
segmentation was done; then the syllable-segmented used to tune the decoder’s parameters and the decoding
corpus was segmented with KyTea. Twelve experiments is done using the phrase-based SMT system MOSES
were performed in total using different amounts of version 0.91 [27].
manually-segmented data to train KyTea (ranging from
100 to 12,000 sentences). 4.4. Evaluation Criteria
We calculated the F-score [24] for each segmentation
method based on 1,000 manually segmented sentences We used two automatic criteria for the evaluation of
of Myanmar (see Table 3). We used Edit Distance of the the SMT. One is the de facto standard automatic
evaluation metric Bilingual Evaluation Understudy
(BLEU) [28] and the other is the Rank-based Intuitive the highest BLEU scores are on (4 gram to 6 gram) and
Bilingual Evaluation Measure (RIBES) [29]. BLEU the highest RIBES scores range from (3 gram to 7 gram).
score measures the precision of 1-grams to 4-grams with A possible explanation is that segmentation quality has a
respect to a reference translation with a penalty for short high degree of variance, and depends on the training
sentences [28]. The BLEU score approximately parameters or initial conditions. Nonetheless, the second
measures the adequacy of SMT and large BLEU scores highest BLEU score (my-ja, 33.45) was achieved by the
are better. RIBES is an automatic evaluation metric “Unsupervised (6 gram)” segmentation model and the
based on rank correlation coefficients modified with second highest RIBES score (my-ja, 0.819) is given by
precision and special care is paid to word order of the “Unsupervised (4 gram) segmentation model. These are
translation results. The RIBES is suitable for distant encouraging results and we believe that this method may
language pairs such as Myanmar and English [29]. the potential to achieve respectable levels of
Large RIBES scores are better. We calculated the performance given sufficient data.
Pearson product-moment correlation coefficient When we analyze the SMT quality of the supervised
(PMCC) between BLEU and F1, and RIBES and F1 to approach, as might be expected we see a strong
assess the strength of the linear relationship between dependence on the quantity of data used to train the
segmentation schemes and quality of SMT. segmentation model. These are come variance in the
results but the better results in terms of both BLEU and
5. Results RIBES scores occurring in the 500-1200 sentence range.
The manually segmented corpus was quite small, and
5.1. Discussion although this approach was unable to match the best
performing methods, it came reasonably close using all
We divided the experiments into three groups (rule the data, and was still improving at this point. We
based, unsupervised and supervised) and divided the therefore expect that pursuing supervised segmentation
target languages into three groups (SOV, SVO and VSO) could lead to a viable method of segmentation for low-
for making the comparison. We highlighted the table resource languages if more manually segmented data
cells of maximum BLEU and RIBES scores for each were available.
target language in Tables 4 to 9. The lowest BLEU scores and RIBES scores were
The results show that “Syllable Breaking” given by “Human Translator” segmentation. The reason
segmentation consistently gives the best BLEU and for this is that most of the translated sentences of the
RIBES scores for all language pairs. The reason might BTEC corpus have no segmentation. The human
be that with syllable segmentation very few errors are translators added this information only sparingly. From
made. As mentioned in Sections 2.2 and 3.1, the these results, we conclude that the partial segmentation
syllables themselves can be delimited with close to provided by the human translators is insufficient to
100% accuracy, and it is in principle possible to group provide useful gains for SMT, at least using the
these syllables to form the words in Myanmar without methodology we adopted in our experiments.
error. Increasing the granularity of the segmentation Visual inspection of the absolute values of the BLEU
above this level can introduce errors in which the and RIBES metrics, would seem to indicate that SMT
sequences of syllables do not constitute a word. For from Myanmar (an SOV language) to SOV languages
example sequences of syllables in erroneous segmented can lead to higher quality translations than from
‘words’ may contain syllables from more than one true Myanmar to SVO and VSO language pairs. This is
word in the language. intuitive since the task of re-ordering is considerably
“Syllable + Maximum Matching” segmentation simpler in this case.
method also consistently give rise to high BLEU and From the overall results, we can make conclusion
RIBES scores. As we mentioned in Section 3.2 that “Syllable Breaking” and “Syllable + Maximum
Maximum Matching is using a dictionary for left-to- Matching” achieved the higher BLEU and RIBES scores
right word segmentation over segmented syllables. than other segmentation methods. We can expect that we
Although this segmentation method can make incorrect can increase these scores higher than current results in
decisions during segmentation, we believe that its error the near future. Both BLEU and RIBES scores proved
rate is low relative to the data-driven methods. This is that the relationship of word segmentation and SMT.
supported by the precision and recall figures shown in
Table 3. 5.2 Error Analysis
When we analyze the “Unsupervised” segmentation
method, the results were quite inconsistent. Generally, Figure 5 shows two examples of translation output to
illustrate how errors in segmentation have an effect on
translation quality. Figure 5 (a), is an example of an 6. Conclusion
alignment error occurred in character breaking of the
Myanmar sentence “ဂ ပန်လူမ ိြိုျား စ ကရျား ရ ကိို ကက င်ျားကက င်ျား In this paper, we investigated the effectiveness of
သိပါသလ ျား။” (Are you familiar with Japanese authors?) in seven Myanmar word segmentation schemes for SMT.
my-ja SMT. Here, the “က” character from the word This paper also contributes the first SMT evaluation
“ကက င်ျားကက င်ျား” is mistakenly mapped to the Japanese from Myanmar to the Japanese, Korean, Hindi, Thai,
word “ は ”. This kind of error occurred in the Chinese and Arabic languages. We built character,
“Character Breaking” segmentation because the syllable and word segmentation schemes for Myanmar
Myanmar character “က” is usually correctly aligned to by using rule based syllable segmentation, maximum
the frequently occurring Japanese particle “は” and this matching based word segmentation, “Bayesian Pitman-
character occurs several times in the Myanmar word Yor” language model based unsupervised word
“ကက င်ျားကက င်ျား”; the system preferred to incorrectly segmentation and “pointwise classifier” based
supervised word segmentation. In most of our
translate “က” as the frequent option “は”. Figure 5 (b)
experiments, the “Syllable Breaking” technique
is an example of a similar alignment error that occurred
achieved the highest SMT evaluation scores in both
in “Syllable + Maximum Matching” segmentation of the
BLEU and RIBES, but we believe that as more more
Myanmar sentence “ဂ င်ဂ ကအျားလ် ကိို ကပျားပါ။” (A ginger ale,
data becomes available both the unsupervised and
please). Here, the Myanmar syllables “ဂ င်” and “ဂ ” are supervised approaches should improve sufficiently to
mistakenly aligned with English words “Jim” and become useful. We propose an elegant new algorithm
“German”. This kind of error occurred in the “Syllable for Myanmar syllable breaking that is simple to
Breaking” segmentation because the word contains implement, has high coverage, and is very accurate. We
other Myanmar word and syllable, namely “ဂ င်” and believe it will be easy to adapt to related Asian syllabic
“ဂ ” which can be translated as “Jim” and “German” languages such as Khmer, Laos, and Nepali. We plan to
respectively. These words are frequently in the corpus extend our study on syllable breaking using extensions
relative to the word for “ginger ale”. A segmenter that of the unsupervised and supervised segmentation
was capable of identifying the word as a single unit methods presented in this paper in the near future.
would have avoided this error.
Figure 6 shows the Pearson product-moment References
correlation coefficient (PMCC) between BLEU scores
of “Syllable Breaking”, “Syllable + Maximum [1] Jing Sun and Yves Lepage (2012), “Can Word
Matching”, “Unsupervised (3 gram to 7 gram)” and “Syl, Segmentation be Considered Harmful for Statistical Machine
Max Match, Unsupervised (3 gram to 7 gram)” Translation Tasks between Japanese and Chinese?”, In
segmentations and F1 of Myanmar to English SMT. We Proceedings of 26th Pacific Asia Conference on Language,
got similar PMCC graphs for Myanmar to other Information and Computation, pp. 351-360.
languages pairs. From the graphs, there appears to be a [2] Jia Xu, Richard Zens and Hermann Ney (2004), “Do We
Need Chinese Word Segmentation for Statistical Machine
moderate level of correlation (e.g. 0.739 for my-en,
Translation?”, In Proceeding of the Third Sighan Workshop on
0.517 for my-ja, 0.555 for my-ko) between the F-score
Chinese Language Learning, pp. 122-128.
for the segmentation quality and the BLEU score. [3] Graham Neubig, Taro Watanabe, Shinsuke Mori, Tatsuya
Although we can show the relationship between the Kawahara (2012), Machine Translation without Words through
word segmentation and the SMT results, it is still very Substring Alignment, In Proceeding of the 50th Annual
hard to make an analysis and formulate measures to Meeting of the Association for Computational Linguistics, pp.
describe the deep relationship between them. This is due 165-174.
to the complexity of the SMT process; the quality of [4] Ning Xi, Guangchao Tang, Xinyu Dai, Shujian Huang,
SMT depends on many factors relating to alignment, re- Jiajun Chen (2012), “Enhancing Statistical Machine
Translation with Character Alignment”, In Proceedings of the
ordering and so on.
50th Annual Meeting of the Association for Computational
As we mentioned in Section 4.1, BTEC corpus
Linguistics, pp. 285-290.
is also still being created and currently contains many [5] Jason Naradowsky and Kristina Toutanova (2011),
errors such as spelling mistakes, translation errors, and “Unsupervised Bilingual Morpheme Segmentation and
problems with the grammar. The SMT evaluation scores Alignment with Context-rich Hidden Semi-Markov Models”,
presented in this paper therefore represent a lower In Proceedings of the 49th Annual Meeting of the Association
bound on what is possible with a larger, cleaner corpus. for Computational Linguistics, pp. 895-904.
[6] Zinmin Wu, Gwyneth Tseng, “Chinese text segmentation
for text retrieval: Achievements and problems”, Journal of the
American Society for Information Science (JASIS), 44(9): pp. [19] Pak-kwong Wong and Chorkin Chan (1996), Chinese
532-542. Word Segmentation based on Maximum Matching and Word
[7] Sun, M., D. Shen and B. K. Tsou (1998), “Chinese word Binding Force, In Proceedings of the 16th conference on
segmentation without using lexicon and hand-craft training Computational Linguistics, Volume 1, pp. 200-203.
data. In Proceeding of COLING-ACL 98, pp. 1265-1271. [20] Department of the Myanmar Language Commission
[8] Constantine P. Papageorgiou, “Japanese Word (1993), Myanmar-English Dictionary, Yangon, Ministry of
Segmentation By Hidden Markov Model”, in Proceeding of a Education.
workshop held at Plainsboro, New Jersey, March 8-11, 1994, [21] Graham Neubig, Masato Mimura, Shinsuke Mori, Tatsuya
pp. 283-288. Kawahara (2010), “Learning a Language Model from
[9] Daichi Mochihashi, Takeshi Yamada, Naonori Ueda, Continuous Speech”, In Proceedings of InterSpeech 2010, pp.
“Bayesian Unsupervised Word Segmentation with Nested 1053-1056.
Pitman-Yor Language Modeling”, in Proceeding of the Joint [22] Graham Neubig, Yosuke Nakata, Shinsuke Mori (2011),
Conference of the 47th Annual Meeting of the ACL and the 4th Pointwise Prediction for Robust, Adaptable Japanese
International Joint Conference on Natural Language Morphological Analysis, In Proceeding of the 49th Annual
Processing of the AFNLP: Volume 1-Volume 1. ACL, 2009, pp. Meeting of the Association for Computational Linguistics:
100-108. Human Language Technologies (ACL-HLT), short papers
[10] Chang Jyun-Shen, C.-D. Chen and Shun-De Chen Volume2, pp. 529-533
“Chinese Word Segmentation through constraint satisfaction [23] Genichiro Kikui, Seiichi Yamamoto, Toshiyuki Takezawa,
and statistical optimization”, in Proceeding of ROCLING IV, and Eiichiro Sumita (2006), Comparative study on corpora for
ROCLING, pp. 147-165. speech translation. In IEEE Transactions on Audio, Speech
[11] Teahan, W. J., Yingying Wen, Rodger McNad and Ian and Language, 14(5), pp.1674-1682
Witten, 2000, “A compression-based algorithm for Chinese [24] Chinese Word Segmentation Evaluation Toolkit
word segmentation, In Computational Linguistics, 26 (3), pp. http://projectile.sv.cmu.edu/research/public/tools/segmentation
375-393. /eval/index.htm
[12] Tun Thura Thet, Jin-Cheon Na, Wunna Ko Ko (2008), [25] Franz Och and Hermann Ney (2000), Improved Statistical
“Word Segmentation for the Myanmar language”, Journal of Alignment Models, In Proceeding of the 38th Annual Meeting
Information Science 34(5), pp. 688-704. of the Association for Computational Linguistics (ACL), pp.
[13] Hla Hla Htay, Kavi Narayana Murthy (2008), Myanmar 440-447.
Word Segmentation Using Syllable Level Longest Matching, [26] Marcello Federico and Mauro Cettolo (2007), Efficient
the 6th Workshop on Asian Language Resources 2008, pp. 41- Handling of N-gram Language Models for Statistical Machine
48. Translation, In Proceedings of the Second Workshop on
[14] Zin Maung Maung, Yoshiki Mikami (2008), A Rule-based Statistical Machine Translation, pp. 88–95.
Syllable Segmentation of Myanmar Text, IJCNLP-08 (The [27] MOSES, 2007. A Factored Phrase-based Beam-search
Third International Joint Conference on Natural Language Decoder for Machine Translation. URL:
Processing), Workshop on NLP for Less Privileged Languages, http://www.statmt.org/moses/.
2008, pp. 51-58. [28] Kishore Papineni, Salim Roukos, Todd Ward, and Wei
[15] Pi-Chuan Chang, Michel Galley, and Christopher D. Jing Zhu (2002), BLEU: a Method for Automatic Evaluation
Manning (2008), Optimizing Chinese word segmentation for of Machine Translation. In Proceedings of the 40th Annual
machine translation performance, in ACL 2008 Third Meeting of the Association for Computational Linguistics
Workshop on Statistical Machine Translation, pp.224-232. (ACL), Philadelphia, USA , pages 311-318.
[16] David Vilar, Jan-Thorsten Peter and Hermann Ney (2007), [29] Hideki Isozaki, Tsutomu Hirao, Kevin Duh, Katsuhito
Can we translate letters, In Second Workshop on Statistical Sudoh, Hajime Tsukada (2010), Automatic Evaluation of
Machine Translation, Prague, Czech Republic, June 2007, pp. Translation Quality for Distant Language Pairs, In
33-39 Proceedings of the 2010 Conference on Empirical Methods on
[17] Preslav Nakov, Jorg Tiedemann (2012), “Combining Natural Language Processing (EMNLP), Oct. 2010, pp. 944-
Word-Level and Character-Level Models for Machine 952.
Translation Between Closely-Related Languages”, In
Proceeding of the 50th Annual Meeting of the Association for
Computational Linguistics (ACL), pp. 301-305.
[18] Yuan Liu, Qiang Tan, and Kun Xu Shen. 1994. The Word
Segmentation Methods for Chinese Information Processing (in
Chinese). Quing Hua University Press and Guang Xi Science
and Technology Press, Page 36.
Table 2: Language Resources of Japanese (ja), Korean (ko), Hindi (Romanized), Hindi (Devanagari), English (en), Thai (th),
Chinese (zh) and Arabic (ar)
Table 3: Number of words, precision, recall and F-1 scores of segmentation methods calculated on manually segmented
1000 sentences
Human Translator
ဟိိုနေ့ိို လူလူကကနတိက
ို ိြိုအထိထငိို ်ခက
ိုုံ ိို_က ိြိုတင်စ ရင်ျားကပျားသွငျား် ခ င်ပါတယ်။
Character Breaking
ဟ_ိ_ြို_န_ိ_ြို_ျော့_လ_ူ_လ_ူ_က_န_က_တ_ိ_ြို_က_ _ိ_ြို_အ_ထ_ိ_ထ_ိ_ြို_င_်_ခ_ြို_ုံ_က_ိ_ြို_က_ြ_ိ_ြို_တ_င_်_စ_ _ရ
_င_်_ျား_ပ_က_ျား_သ_ွ_င_်_ျား_ခ_ _င_်_ပ_ါ_တ_ယ_်_။
Syllable Breaking
ဟိ_ို နိိုေ့_လူ_လူ_က_ကန_တိ_ို က ိြို_အ_ထိ_ထိငို _် ခို_ုံ ကိ_ို က ိြို_တင်_စ _ရင်ျား_ကပျား_သွငျား် _ခ င်_ပါ_တယ်_။
Unsupervised (3-gram)
ဟိ_ို နိိုေ့_လူလူ_ကကန_တိက
ို ိြို_အထိ_ထိငို ်ခိုုံကိို_က ိြို_တင်စ _ရင်ျားကပျားသွငျား် ခ င်ပါ_တယ်။
Unsupervised (7-gram)
ဟိ_ို နိိုေ့_လူ_လူ_ကကန_တိက
ို ိြို_အထိ_ထိငို ်ခ_ိုုံ ကိ_ို က ိြိုတင်စ _ရင်ျား_ကပျား_သွငျား် ခ င်_ပါ_တယ်_။
ဟိို နိေ့ို လူ လူ က ကန တိို က ြိုိ အ ထိ ထိိုင် ခိုုံ ကိို က ြိုိ တင် စ ရင်ျား ကပျား သွင်ျား ခ င် ပါ တယ် ။
Table 4: BLEU scores for Human Translator, Character Breaking, Syllable Breaking and Syllable + Maximum Matching
segmentation
Table 8: RIBES scores for Unsupervised (3 to 7 gram), Syl, Max Match, Unsupervised (3 to 7 gram) segmentation
Figure 6. The correlation between BLEU and segmentation F-score for my-en