0% found this document useful (0 votes)
70 views29 pages

Overview of Myanmar-English Machine Translation System - NICT

The document discusses natural language processing (NLP) research being conducted at the NLP Lab at the University of Computer Studies, Yangon (UCSY) in Myanmar. It provides an overview of the lab's history and current research areas, which include machine translation, speech recognition, information retrieval and other tasks. It also describes collaboration with other institutions and the use of aligned language data to help address challenges in NLP for Myanmar.

Uploaded by

Lion Couple
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
70 views29 pages

Overview of Myanmar-English Machine Translation System - NICT

The document discusses natural language processing (NLP) research being conducted at the NLP Lab at the University of Computer Studies, Yangon (UCSY) in Myanmar. It provides an overview of the lab's history and current research areas, which include machine translation, speech recognition, information retrieval and other tasks. It also describes collaboration with other institutions and the use of aligned language data to help address challenges in NLP for Myanmar.

Uploaded by

Lion Couple
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Myanmar NLP research and

Usefulness of ALT data

Dr. Khin Mar Soe


Professor
NLP Lab, UCSY
26-11-2015
Contents

 Introduction to UCSY
 Introduction to UCSY NLP Lab
 Current Myanmar NLP Research
 Usefulness of ALT Data
 Conclusion

2
http://www.ucsy.edu.mm 3
Natural Language Processing Lab in UCSY
 started in 2006 at University of Computer Studies, Yangon
(UCSY) under Ministry of Science and Technology.
 Some of the works of the NLP lab are available online:
◦ Network-based ASEAN Languages Translation Public Service
(http://www.aseanmt.org)
◦ English to Myanmar Statistical Machine Translation System
(http://www.nlpresearch-
ucsy.edu.mm/NLP_UCSY/mtapplication.html)
◦ Myanmar-English-Myanmar bilingual dictionary
(http://www.nlpresearch-
ucsy.edu.mm/NLP_UCSY/dictionaryapplication.html)
◦ Myanmar Word Segmentation
(http://www.nlpresearch-ucsy.edu.mm/NLP_UCSY/wsandpos.html)

4
Research Collaboration
 NECTEC (Thailand National Electronics and Computer
Technology Center)
 NICT (National Institute of Information and
Communication Technology)

 For the purpose of


◦ joint researches/projects,
◦ researcher exchange,
◦ publishing conference papers, journals and articles,
◦ doing joint NLP workshops.

5
NLP Lab

6
NLP Lab Members

7
NLP Research
Aim of Research
 to overcome language barrier
 to be applied conveniently in systems that are used by
Myanmar

 Domain of Research
◦ Myanmar-English-Myanmar Machine Translation
◦ Automatic Speech Recognition
◦ Text to Speech
◦ Myanmar Information Retrieval
◦ Myanmar Name Entity Recognition and Transliteration
◦ Myanmar Text Summarization
◦ Myanmar Text Categorization
8
Overview of the System

Alignment

Source Language
Analysis
Dictionary

Translation

Word Sense
Target Language
Disambiguation
Generation
Source Language Analysis
 For Myanmar-English translation phase, it is the process
of Myanmar Language Analyzer:
◦ Myanmar Part-of-Speech (POS) Tagging and
Chunking of Myanmar Language
◦ Syntactic Analysis
 Function Tagging and making Grammatical relation

• For English-Myanmar translation phase,


• English POS and Chunking
◦ Syntactic Analysis
 Function Tagging and making Grammatical relation
Myanmar POS Tagging and Chunking

Myanmar
Word Identification and Basic POS Tagging
Lexicon

POS Basic POS Tag Disambiguation


Tagged
Corpus

Normalization Normalization
Rules

Chunk Rules Chunking


Pre-tagged Corpus Format :
 Training Corpus

o Myanmar words are segmented and tagged with their respective


basic POS tags and categories as follows ::

 သူ/PRN.Person # ေက်ာင္း/NN.Building # သိ/ု႔ PPM.Direction #


သြား/VB.Common # သည္/SF.Declarative

 ေက်ာင္းသား/NN.Person # မ်ား/Part.Number # ထဲတင


ြ ္/PPM.Extract
# သူ/PRN.Person # အ/Part.Common # ေတာ္/JJ.Dem #
ဆုံး/Part.Common # ျဖစ္/VB.Common # သည္/SF.Declarative

 ဤ/PRN.Distobj # စာ/NN.Common # ကု/ိ PPM.Obj #


မည္သ/ူ PRN.Question # ေရး/VB.Common # ခဲ႔ /Part.Support #
သနည္း/SF.Interrogative
Example : Tagging
 Input Text
 သံလြင္ ျမစ္ သည္ ျမန္မာျပည္ ေတာင္ပုိင္း သုိ႔ ဦးတည္ စီးဆင္း သြား သည္။
(The river, Than Lwin, flows to south of Myanmar.)

 Tagging with All Possible Tags on Each Word

 သံလြင္_#NNP.Location
 ျမစ္ _#NN.Location
 သည္ _#SF.Declarative #PPM.Subj
 ျမန္မာျပည္ _#NNP.Location
 ေတာင္ပုိင္း_#NN.Location
 သု႔ိ _#PPM.Direction
 ဦးတည္_#VB.Common
 စီးဆင္း _#VB.Common
 သြား_#VB.Common#NN.Body#Part.Support
Disambiguation of Tags

• disambiguating all possible basic POS tags to


produce the correct tag.

• training Myanmar pre-tagged Corpus with HMMs


and LHMMs models.

• decoding using the Viterbi tagging algorithm to find


out the best probable path (best tag sequence) for a
given word sequence.
Example : Disambiguation
 Disambiguation and Assigning with Correct Tag on Each Word

 သံလြင္_#NNP.Location (Than Lwin)


 ျမစ္ _#NN.Location (The river)
 သည္ _#PPM.Subj (null)
 ျမန္မာျပည္ _#NNP.Location (Myanmar)
 ေတာင္ပုိင္း_#NN.Location (south)
 သု႔ိ _#PPM.Direction (to)
 ဦးတည္_#VB.Common (flows)
 စီးဆင္း _#VB.Common (flows)
 သြား_#Part.Support (flows)
 သည္ _#SF.Declarative (null)
Example : Normalization

• forming more meaningful words and annotating with


appropriate POS tags and categories.

 Before normalization,
"က်န္းမာ/VB.Common # ျခင္း/Part.Common # သည္ /PPM.Subj #
လာဘ္/NN.Common # တစ္/NN.Cardinal # ပါး/Part.Type #
ျဖစ္/VB.Common # သည္ /SF.Declarative"

 After normalization,
"က်န္းမာျခင္း/NN.VBConvert # သည္ / PPM.Subj # လာဘ္ / NN.Common #
တစ္ / NN.Cardinal # ပါး / Part.Type # ျဖစ္/ VB.Common # သည္ /
SF.Declarative "
Example : Chunking
• assemble the POS tagged words and identify chunk tag.

 Before chunking,
သူတုိ႔/NNR.Person # သည္/PPM.Subj # အတန္း/NN.Common #
ထဲတြင/္ PPM.Extract # အေတာ္ဆုံး/JJS.Common #
ေက်ာင္းသားမ်ား/NNR.Person# ျဖစ္/VB.Common # ၾက/Part.Support #
သည္/SF.Declarative

 After chunking,
NC [သူတုိ႔/NNR.Person] # PPC [သည္/PPM.Subj] # NC
[အတန္း/NN.Common] # PPC [ထဲတင
ြ /္ PPM.Extract] # NC
[အေတာ္ဆုံး/JJS.Common # ေက်ာင္းသားမ်ား/NNR.Person] # VC
[ျဖစ္/VB.Common # ၾက/Part.Support] # SFC [သည္/SF.Declarative]
Alignment
 Identifying word correspondence that are
translations of each other based on information
found on parallel text.

 Developing a Myanmar-English bilingual corpus:


◦ Dictionary lookup approach
◦ Corpus-based approach
Word Alignment Algorithm
Step 1: Accept pair of Myanmar and English sentences.
Step 2: Tag English sentence with Part-Of-speech (POS)
Tagger and it will produce tagged output also with
root word.
Step 3: Segment Myanmar sentence into words.
Removes the stop words.
Make morphological analysis of the noun and verb affixes
using trigram method.
Step 4: Align the output English and Myanmar words from
step 2 and 3 based on the first three IBM models and EM
algorithm using parallel corpus.
Step 5: Align the remaining words (i.e unaligned) using Myanmar-
English bilingual dictionary.
Example Alignment

သူ ေက်ာင္း သိ႕ု ေျခလ်င္ သြားသည္။

He goes to school on foot.


Problems in Alignment

 Scarce Resource
 No publicly available POS-tagged corpus for Myanmar and
English.
 The constructed POS-tagged corpus has a limited number in
size.

 Linguistic Problem
 Parallel sentence pairs might not be equal size.
 Myanmar and English word order could be significantly
different.
 Myanmar language is a morphologically rich and verb final
language. English is a verb-second language.

21
Translation

 Phrase/word Translation pairs Extraction


 Morphological Analysis
 Word Sense Disambiguation
Phrase/word Extraction
 For each phrase we identified by its start position, end
positions phrase length and target phrase to ensure that
there are no gaps and no overlap.
 Applying N-gram methods using Corpus,
Source Start End Phrase Target Translation
phrase position position Length phrase probability

ငွက္ 1 1 1 Bird 1.0

ငွက္မ်ား 1 2 2 Birds 1.0

ပ်ံ 4 4 1 Fly 1.0

ပ်ံၾကသည္ 4 6 3 Fly 1.0

 Translation
ငွက္မ်ား - birds
ပ်ံၾကသည္ - fly
Example : Morphological Analysis of verbs

• Myanmar unknown verb: ၾကည့္ခဲ့ပါသည္


• Main Verb: ၾကည့္
• Verb suffiex: ခဲ့ပါသည္
• Tense particle: ခဲ့
• Translation of main verb (using corpus): look
• Generation of surface word: ၾကည့္/look, ခဲ့/past
ပါသည္/null(suffix)

• ၾကည့္ခဲ့ပါသည္/looked
Word Sense Disambiguation for Myanmar
Language
 Purpose:
◦ to solve the ambiguity of Myanmar words for Myanmar-
English machine translation
Ambiguous Example
 Noun Examples
chopsticks

တူ nephew

hammer

 သူသည္တျူ ဖင့္ေခါက္ဆဲြစားသည္။ He eats the noodle with chopsticks.


 သူ႔မွာတူသံုးေယာက္ရွိသည္။ He has three nephews.
 လက္သမားသည္တူကိုသံုးသည္။ Carpenter uses the hammer.
WSD Algorithm for Myanmar Word
Step1:Preprocessing
-Segment input sentence
-Remove stop words from input sentence and create ambiguous vector
Step2:Multi-sense Look-up
-Retrieve all possible sense meanings of ambiguous word from corpus
-Collect training data concerning with these sense from corpus
Step3:Build context vectors for each sense based on collected
training data
-For all context vectors do
-Remove stop words
-Remove redundant words
-End For
Step4:Calculate the cosines between ambiguous vector and each
of the context vectors

where A represents each word in ambiguous vector


B represents each word in each context vector
Step5:Choose correct sense of the target word 27
s' = argmax score(si)
Conclusion
 The data sparseness is most important in many research
regarding NLP because of the followings:
◦ The rules only can not be solved for all problems for
many languages.
◦ So, the researches are coming based on the statistical
model.
◦ The more availability of data in developing the
system/tools, the more accuracy we can get.
 So, ALT data is very useful not only for Myanmar
language but also for all languages to be applied in various
kinds of NLP researches.
Thank you!

You might also like