Myanmar NLP research and
Usefulness of ALT data
Dr. Khin Mar Soe
Professor
NLP Lab, UCSY
26-11-2015
Contents
Introduction to UCSY
Introduction to UCSY NLP Lab
Current Myanmar NLP Research
Usefulness of ALT Data
Conclusion
2
http://www.ucsy.edu.mm 3
Natural Language Processing Lab in UCSY
started in 2006 at University of Computer Studies, Yangon
(UCSY) under Ministry of Science and Technology.
Some of the works of the NLP lab are available online:
◦ Network-based ASEAN Languages Translation Public Service
(http://www.aseanmt.org)
◦ English to Myanmar Statistical Machine Translation System
(http://www.nlpresearch-
ucsy.edu.mm/NLP_UCSY/mtapplication.html)
◦ Myanmar-English-Myanmar bilingual dictionary
(http://www.nlpresearch-
ucsy.edu.mm/NLP_UCSY/dictionaryapplication.html)
◦ Myanmar Word Segmentation
(http://www.nlpresearch-ucsy.edu.mm/NLP_UCSY/wsandpos.html)
4
Research Collaboration
NECTEC (Thailand National Electronics and Computer
Technology Center)
NICT (National Institute of Information and
Communication Technology)
For the purpose of
◦ joint researches/projects,
◦ researcher exchange,
◦ publishing conference papers, journals and articles,
◦ doing joint NLP workshops.
5
NLP Lab
6
NLP Lab Members
7
NLP Research
Aim of Research
to overcome language barrier
to be applied conveniently in systems that are used by
Myanmar
Domain of Research
◦ Myanmar-English-Myanmar Machine Translation
◦ Automatic Speech Recognition
◦ Text to Speech
◦ Myanmar Information Retrieval
◦ Myanmar Name Entity Recognition and Transliteration
◦ Myanmar Text Summarization
◦ Myanmar Text Categorization
8
Overview of the System
Alignment
Source Language
Analysis
Dictionary
Translation
Word Sense
Target Language
Disambiguation
Generation
Source Language Analysis
For Myanmar-English translation phase, it is the process
of Myanmar Language Analyzer:
◦ Myanmar Part-of-Speech (POS) Tagging and
Chunking of Myanmar Language
◦ Syntactic Analysis
Function Tagging and making Grammatical relation
• For English-Myanmar translation phase,
• English POS and Chunking
◦ Syntactic Analysis
Function Tagging and making Grammatical relation
Myanmar POS Tagging and Chunking
Myanmar
Word Identification and Basic POS Tagging
Lexicon
POS Basic POS Tag Disambiguation
Tagged
Corpus
Normalization Normalization
Rules
Chunk Rules Chunking
Pre-tagged Corpus Format :
Training Corpus
o Myanmar words are segmented and tagged with their respective
basic POS tags and categories as follows ::
သူ/PRN.Person # ေက်ာင္း/NN.Building # သိ/ု႔ PPM.Direction #
သြား/VB.Common # သည္/SF.Declarative
ေက်ာင္းသား/NN.Person # မ်ား/Part.Number # ထဲတင
ြ ္/PPM.Extract
# သူ/PRN.Person # အ/Part.Common # ေတာ္/JJ.Dem #
ဆုံး/Part.Common # ျဖစ္/VB.Common # သည္/SF.Declarative
ဤ/PRN.Distobj # စာ/NN.Common # ကု/ိ PPM.Obj #
မည္သ/ူ PRN.Question # ေရး/VB.Common # ခဲ႔ /Part.Support #
သနည္း/SF.Interrogative
Example : Tagging
Input Text
သံလြင္ ျမစ္ သည္ ျမန္မာျပည္ ေတာင္ပုိင္း သုိ႔ ဦးတည္ စီးဆင္း သြား သည္။
(The river, Than Lwin, flows to south of Myanmar.)
Tagging with All Possible Tags on Each Word
သံလြင္_#NNP.Location
ျမစ္ _#NN.Location
သည္ _#SF.Declarative #PPM.Subj
ျမန္မာျပည္ _#NNP.Location
ေတာင္ပုိင္း_#NN.Location
သု႔ိ _#PPM.Direction
ဦးတည္_#VB.Common
စီးဆင္း _#VB.Common
သြား_#VB.Common#NN.Body#Part.Support
Disambiguation of Tags
• disambiguating all possible basic POS tags to
produce the correct tag.
• training Myanmar pre-tagged Corpus with HMMs
and LHMMs models.
• decoding using the Viterbi tagging algorithm to find
out the best probable path (best tag sequence) for a
given word sequence.
Example : Disambiguation
Disambiguation and Assigning with Correct Tag on Each Word
သံလြင္_#NNP.Location (Than Lwin)
ျမစ္ _#NN.Location (The river)
သည္ _#PPM.Subj (null)
ျမန္မာျပည္ _#NNP.Location (Myanmar)
ေတာင္ပုိင္း_#NN.Location (south)
သု႔ိ _#PPM.Direction (to)
ဦးတည္_#VB.Common (flows)
စီးဆင္း _#VB.Common (flows)
သြား_#Part.Support (flows)
သည္ _#SF.Declarative (null)
Example : Normalization
• forming more meaningful words and annotating with
appropriate POS tags and categories.
Before normalization,
"က်န္းမာ/VB.Common # ျခင္း/Part.Common # သည္ /PPM.Subj #
လာဘ္/NN.Common # တစ္/NN.Cardinal # ပါး/Part.Type #
ျဖစ္/VB.Common # သည္ /SF.Declarative"
After normalization,
"က်န္းမာျခင္း/NN.VBConvert # သည္ / PPM.Subj # လာဘ္ / NN.Common #
တစ္ / NN.Cardinal # ပါး / Part.Type # ျဖစ္/ VB.Common # သည္ /
SF.Declarative "
Example : Chunking
• assemble the POS tagged words and identify chunk tag.
Before chunking,
သူတုိ႔/NNR.Person # သည္/PPM.Subj # အတန္း/NN.Common #
ထဲတြင/္ PPM.Extract # အေတာ္ဆုံး/JJS.Common #
ေက်ာင္းသားမ်ား/NNR.Person# ျဖစ္/VB.Common # ၾက/Part.Support #
သည္/SF.Declarative
After chunking,
NC [သူတုိ႔/NNR.Person] # PPC [သည္/PPM.Subj] # NC
[အတန္း/NN.Common] # PPC [ထဲတင
ြ /္ PPM.Extract] # NC
[အေတာ္ဆုံး/JJS.Common # ေက်ာင္းသားမ်ား/NNR.Person] # VC
[ျဖစ္/VB.Common # ၾက/Part.Support] # SFC [သည္/SF.Declarative]
Alignment
Identifying word correspondence that are
translations of each other based on information
found on parallel text.
Developing a Myanmar-English bilingual corpus:
◦ Dictionary lookup approach
◦ Corpus-based approach
Word Alignment Algorithm
Step 1: Accept pair of Myanmar and English sentences.
Step 2: Tag English sentence with Part-Of-speech (POS)
Tagger and it will produce tagged output also with
root word.
Step 3: Segment Myanmar sentence into words.
Removes the stop words.
Make morphological analysis of the noun and verb affixes
using trigram method.
Step 4: Align the output English and Myanmar words from
step 2 and 3 based on the first three IBM models and EM
algorithm using parallel corpus.
Step 5: Align the remaining words (i.e unaligned) using Myanmar-
English bilingual dictionary.
Example Alignment
သူ ေက်ာင္း သိ႕ု ေျခလ်င္ သြားသည္။
He goes to school on foot.
Problems in Alignment
Scarce Resource
No publicly available POS-tagged corpus for Myanmar and
English.
The constructed POS-tagged corpus has a limited number in
size.
Linguistic Problem
Parallel sentence pairs might not be equal size.
Myanmar and English word order could be significantly
different.
Myanmar language is a morphologically rich and verb final
language. English is a verb-second language.
21
Translation
Phrase/word Translation pairs Extraction
Morphological Analysis
Word Sense Disambiguation
Phrase/word Extraction
For each phrase we identified by its start position, end
positions phrase length and target phrase to ensure that
there are no gaps and no overlap.
Applying N-gram methods using Corpus,
Source Start End Phrase Target Translation
phrase position position Length phrase probability
ငွက္ 1 1 1 Bird 1.0
ငွက္မ်ား 1 2 2 Birds 1.0
ပ်ံ 4 4 1 Fly 1.0
ပ်ံၾကသည္ 4 6 3 Fly 1.0
Translation
ငွက္မ်ား - birds
ပ်ံၾကသည္ - fly
Example : Morphological Analysis of verbs
• Myanmar unknown verb: ၾကည့္ခဲ့ပါသည္
• Main Verb: ၾကည့္
• Verb suffiex: ခဲ့ပါသည္
• Tense particle: ခဲ့
• Translation of main verb (using corpus): look
• Generation of surface word: ၾကည့္/look, ခဲ့/past
ပါသည္/null(suffix)
• ၾကည့္ခဲ့ပါသည္/looked
Word Sense Disambiguation for Myanmar
Language
Purpose:
◦ to solve the ambiguity of Myanmar words for Myanmar-
English machine translation
Ambiguous Example
Noun Examples
chopsticks
တူ nephew
hammer
သူသည္တျူ ဖင့္ေခါက္ဆဲြစားသည္။ He eats the noodle with chopsticks.
သူ႔မွာတူသံုးေယာက္ရွိသည္။ He has three nephews.
လက္သမားသည္တူကိုသံုးသည္။ Carpenter uses the hammer.
WSD Algorithm for Myanmar Word
Step1:Preprocessing
-Segment input sentence
-Remove stop words from input sentence and create ambiguous vector
Step2:Multi-sense Look-up
-Retrieve all possible sense meanings of ambiguous word from corpus
-Collect training data concerning with these sense from corpus
Step3:Build context vectors for each sense based on collected
training data
-For all context vectors do
-Remove stop words
-Remove redundant words
-End For
Step4:Calculate the cosines between ambiguous vector and each
of the context vectors
where A represents each word in ambiguous vector
B represents each word in each context vector
Step5:Choose correct sense of the target word 27
s' = argmax score(si)
Conclusion
The data sparseness is most important in many research
regarding NLP because of the followings:
◦ The rules only can not be solved for all problems for
many languages.
◦ So, the researches are coming based on the statistical
model.
◦ The more availability of data in developing the
system/tools, the more accuracy we can get.
So, ALT data is very useful not only for Myanmar
language but also for all languages to be applied in various
kinds of NLP researches.
Thank you!