0% found this document useful (0 votes)
63 views5 pages

A Comprehensive Dialect Conversion Approach From Chittagonian To Standard Bangla

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views5 pages

A Comprehensive Dialect Conversion Approach From Chittagonian To Standard Bangla

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/342467631

A Comprehensive Dialect Conversion Approach from Chittagonian to Standard


Bangla

Conference Paper · June 2020


DOI: 10.1109/TENSYMP50017.2020.9230714

CITATION READS
1 66

4 authors:

Hafizur Rahman Milon Sheikh Nasir Uddin Sabbir


United International University United International University
2 PUBLICATIONS   1 CITATION    2 PUBLICATIONS   1 CITATION   

SEE PROFILE SEE PROFILE

Azfar Inan Nahid Hossain


United International University United International University
2 PUBLICATIONS   1 CITATION    14 PUBLICATIONS   24 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

A Comprehensive Dialect Converter using NLP from Chittagonian to Standard Bangla View project

All content following this page was uploaded by Nahid Hossain on 09 April 2022.

The user has requested enhancement of the downloaded file.


2020 IEEE Region 10 Symposium (TENSYMP), 5-7 June 2020, Dhaka, Bangladesh

A Comprehensive Dialect Conversion Approach


from Chittagonian to Standard Bangla
Hafizur Rahman Milon, Sheikh Nasir Uddin Sabbir, Azfar Inan, Nahid Hossain
Department Of Computer Science and Engineering
United International University, Dhaka, Bangladesh
[email protected], [email protected], [email protected], [email protected]

Abstract—We present a comprehensive conversion system to rate of 85% [4]. In 2012, G.H. Al-Gaphai et al. worked
convert the Chittagonian dialect to standard Bangla language. with 9386 words and their rule-based approach yielded an
It is a text to text conversion system based on word-to-word accuracy of 77.32% [5]. Hitahm Abo Bakr et al. proposed a
mapping adopting a bilingual dictionary, rule-based morpho-
logical transformation on suffixes, and a supportive word sug- hybrid approach for converting Egyptian colloquial to Modern
gestion module. The system tokenizes the regional input text Standard Arabic with an accuracy of 88% in 2008 [6]. They
and processes the tokens through word-to-word mapping and used tokenization and POS (Parts of Speech) tagging to
morphological transformation using suffix transformation rules improve the performance of their system. Md. Shahnur Azad
if word-to-word mapping fails. We are also introducing an aiding Chowdhury worked on Bangla to English machine translation
tool that generates suggested words for the dialectal input. The
system achieved an accuracy of 94.75% for producing standard using POS tagging [7]. He used Tag Vectors and a set of
Bangla translation from Chittagonian words. It must be noted grammar rules for the conversion process.
that there is no published work on the Chittagonian dialect Our proposed system is the first that provides a compre-
conversion from a computational point of view. We are the hensive solution. We have created a bilingual dictionary as
first ones to have built such a system for Chittagonian dialect to the dataset to map standard Bangla word for Chittagonian
standard Bangla conversion.
word. If the word-to-word mapping fails to give a proper
Keywords—Bangla, Dialect, Chittagonian, Double Metaphone. translation, the system moves to suffix transformation. It
splits each token into a root and a suffix and performs word-
to-word mapping on the root word. We have used POS
I. Introduction tagging to find the proper suffix that fits with the standard
According to many linguists and researchers, dialects are Bangla root word. We have also provided a word suggestion
just a different form of the language, spoken with different module since people might spell the same word differently.
accents and morphemes. A dialect may even have its own We acquired the suggestions by means of Double Metaphone
grammar and sentence rules. Some dialects are rich enough Encoding [8], LCS (Longest Common Subsequence) [9] [10],
to be accepted as a full-fledged language. Chittagonian is one and K-NN (K-Nearest Neighbors) [11]. Double Metaphone
of the principal dialects of Bangla language that is spoken algorithm encodes the input into corresponding English letters,
widely across the south-eastern region as the only means of LCS compares Double Metaphone encodings to determine
communication. It is one of the most intricate dialects for similarity and K-NN finds the closest matches to generate the
the non-native standard Bangla speakers to understand as it is suggestions.
rich with its words and phrases. Activities like establishing Section II describes the proposed system and presents step
deals and finding accommodations prove to be challenging by step explanation of our work along with algorithms. The
from time to time. To cope with this, people are using English experimental results and performance analysis is provided in
and standard Bangla more; as a result, this enriched dialect is section III, section IV concludes the paper with limitations of
losing its speakers day by day. the system and future work.
As we have mentioned earlier, no notable work has been
done yet that deals with the conversion of the Chittagonian II. Proposed Method
dialect. In 2017, Amrita Das presented an in-depth study on In this section, we have incorporated the whole process step
Sylheti grammar which helped us to work with Chittagonian by step in detail.
grammar [1]. Mohammad Azizul Hoque’s 2015 paper on
Chittagonian language describing Chittagonian grammar, word A. Dataset Collection and Corpus Study
pronunciation which helped us with our research [2]. In Chittagonian dialect has a very different set of words than
2015, Arvinder Singh et al. proposed a converter for Punjabi that of standard Bangla. The key part of the converter is
dialects that worked using a rule-based approach and bilingual the dataset. Accuracy and time complexities are immensely
dictionary [3]. In 2014, K Marimuthu et al. provided a dependable on the dataset alone. Chittagonian dialect hardly
method to convert dialectal Tamil text to standard Tamil text has any resources in written format. Although it’s enriched
using Finite State Transducers, which yielded an accuracy in culture and literature, it lacks written texts, especially in a

978-1-7281-7366-5/20/$31.00 ©2020 IEEE


digital format. We had to build the dataset from scratch. We Some Chittagonian words have different standard Bangla
have collected most of the data from the book [12] by Noor meanings. That creates a one-to-many relationship. For ex-
Mohammad Rafiq. The book has 8500 Chittagonian words ample, the word 'আই' means both 'আিম' and 'এেস' in different
along with Bangla translation and about 100 complete sentence contexts. At this stage of our work, we omit processing
examples. Secondary sources were websites and social media words based on their context. Some examples of word-to-
posts and comments. Finally, our dataset contains 20,101 word mapping between Chittagonian and Bangla words are
Chittagonian root words and 5,010 complete Chittagonian given in Table I.
sentences for rule generation and 2,230 complete Chittagonian
sentences for testing the system’s performance. After studying Table I
the corpus, we have noticed no significant differences between Word-to-Word mapping examples.
Chittagonian and standard Bangla in terms of grammatical Chittagonian Word Standard Bangla Word
rules. The key differences were words and suffixes and in অেনরা আপনারা
some negative sentences. Also, we have noticed that the অলপল অেগাছােলা
spelling of a particular word may differ from person to person ফইয়া িভক্ষুক
ইতাের তােক
based on their accents and preferences.
আইন্দা আগামী
B. Translation Methodologies
We have used Tokenization, Rule-based Negation Handling,
Word-to-Word Mapping, and Morphological Transformation Algorithm 1: Translation
using suffix rules. Fig 1 shows the entire process of the 1: inputSentence ← str()
Translation module. 2: tokens ← T okenize(inputSentence)
3: Handle for negative sentences if any
4: for each ctgWord in tokens do
5: bngW ord = T ranslate(ctgW ord)
6: if bngW ord = ”” then
7: Split ctgWord into ctgStem and ctgSuffix
8: bngStem = T ranslate(ctgStem)
9: if bngStem = ”” then
10: bngW ord = ctgW ord
11: else
12: Get bngSuffix based on bngStem and ctgSuffix
13: bngW ord = bngStem + bngSuf f ix
14: end if
Figure 1. System Diagram of Translation module.
15: translation+ = bngW ord + ”space”
16: end if
17: end for
1) Tokenization: In tokenization, we split the input
sentence into separate words or tokens using Python’s basic
split method and prepare those tokens for further processing. 4) Morphological Transformation Using Suffix Rules: If
For example, the sentence 'হাই ন ডরাই' is split into 'হাই', 'ন' word-to-word mapping fails to generate a translation (i.e.
and 'ডরাই'. returns empty string), the system processes the input word
using a rule-based approach. The system splits the input word
2) Rule-based Negation Handling: In most of the into stem and suffix utilizing a collection of Chittagonian
Chittagonian negative sentences, the negative word 'ন' suffixes and translates the stem to a standard Bangla stem
precedes the Verb (aka. িকৰ্য়া ) whereas it follows the Verb in using word-to-word mapping. Then the Chittagonian suffix is
standard Bangla. For example, in the sentence 'হাই ন ডরাই', mapped into the corresponding standard Bangla suffix using
the negative word 'ন' comes before the Verb while in standard transformation rules. Finally, the system adds the output root
Bangla sentence 'আিম ভয় পাই না', the negative word 'না' sits word and the suffix to generate the standard Bangla word.
after the Verb. We handle these negations based on rules we We have noticed that the suffix transformation rules are
have generated after analyzing Chittagonian sentences with variant depending on the last character of the standard Bangla
negations. root. Different suffixes are produced for the last character to
be a vowel (সব্রবণর্) or a consonant (ব ঞ্জনবণর্). For example, the
3) Word-to-Word Mapping: This is the most important Bangla root words 'েদশ' and 'দু িনয়া' in table II transformed the
phase of the conversion process. Each input word is mapped same Chittagonian suffix 'ত' differently. The rules are also
into its corresponding standard Bangla word. It is a simple different on the same suffix for different POS. For example,
one-to-one mapping based on the input word. 'েদশত' => 'েদেশ' (িবেশষ ) and 'টাইলত' => 'কাটাত' (িকৰ্য়া) with
the same suffix 'ত'. Some example rules are shown in table Algorithm 2: Word Suggestion
II. 1: inputSentence ← str()
2: tokens ← T okenize(inputSentence)
Table II 3: for each wordX in input do
Suffix Rules examples.
4: dmX ← DM etaphone(wordX)
5: for each wordY in Chittagonian do
Chittagonian Bangla Bangla Last Bangla 6: dmY ← DM etaphone(wordY )
Word Stem Character Word
েদশত = েদশ + ত েদশ Consonant েদেশ = েদশ + ে◌ 7: dmXLen = len(dmX)
দু ন্নাইত = দু ন্নাই + ত দু িনয়া Vowel দু িনয়ায় = দু িনয়া + য় 8: dmY Len = len(dmY )
দু েক্ক = দু ক্ক + ে◌ দু ঃখ Consonant দু ঃেখ = দু ঃখ + ে◌ 9: if dmXLen > dmY Len then
কাদনর = কাদন + র কান্না Vowel কান্নার = কান্না + র 10: lcsLen = LCS(dmX, dmY )
ফইরাের = ফইরা + ের িভক্ষুক Consonant িভক্ষুকেক = িভক্ষুক + েক
11: else
পাত্তরগান = পাত্তর + গান পাথর Consonant পাথরিট = পাথর + িট
12: lcsLen = LCS(dmY, dmX)
13: end if
14: Add lcsLen to lcsLenList
15: end for
C. Word Suggestion Methodologies 16: Find top 3 suggested words with highest lcsLen using
There are no conventional spelling rules for Chittagonian K-NN
dialect. The spelling of a particular word may differ from 17: return suggestions
person to person based on their accents and preferences. For 18: end for
example, the word 'অেনরা' from one user may be spelled
1) Double Metaphone Encoding: The given words in the in-
differently as 'অনারা' or 'হেনরা' by another. Both of them are
put sentence are encoded using Double Metaphone Encoding.
potentially correct meaning 'আপনারা' in standard Bangla. This
We have implemented Mumit Khan’s Double Metaphone en-
could be catastrophic for the system as the system might fail
coding table [14]. This process encodes the Bangla alphabets
to generate a correct translation. We have introduced the word
into corresponding English characters. Each Bangla character
suggestion module to tackle this problem.
is coded with one or more English characters based on differ-
This module checks the input words and provides the user
ent contexts of the word. In our Double Metaphone encoding,
with a collection of suggested words for each input word. Key
we coded only consonant characters (ক,খ...). We didn’t encode
techniques used in the word suggestion module are: Double
vowels (অ,আ...), since vowels do not put significant difference
Metaphone Encoding, LCS and K-NN. Table III shows some
in pronunciation of a word [13] [14]. Some examples are
sample outputs of the word suggestion module.
provided in Table IV.

Table III Table IV


Word Suggestion examples. Double Metaphone Encoding examples.
Input Word Suggested Words Chittagonian Word Double Metaphone Encoding
অেনরা অেনরা, অনারা, ঐন্না অেনরা onera
উযু উযু , উযু উযু , উইযু ই অলপল olpl
আনিন আনিন, আনেটংিস, আনযা-আনিয ফইয়া piya
ইয়ত ইয়ত, অিছয়ত, ঐয়ত ইতাের itare

2) Longest Common Subsequence: LCS finds all the pos-


sible subsequences in a string and also outputs the largest
subsequence existing in a string. The Double Metaphone
encodes of input words are used for string matching using
LCS. It generates the LCS length of two words that is the
length of the longest match between them. For each input
word, a list of LCS lengths is generated for all the Chittagonian
words. The list is then passed onto the K-NN process.
Examples are provided in Table V.
3) K-Nearest Neighbors: We’ve used K-NN to find the
words that match the highest with input word and output
as suggested words. The LCS lengths work as the distance
attribute here. The highest LCS length means the lowest
distance and the lowest means the highest. All it does is find
Figure 2. Word Suggestion System Diagram. the words with the highest LCS lengths and suggest words
placing the closest word at the top of the suggestion list. We think that the lack of dataset or words is the main reason
pass the LCS length list as shown in table V onto K-NN, then, behind the one peculiar output. A larger dataset would nudge
the system processes the LCS lengths of each word and find the module towards generating more relevant suggested words.
the closest ones. Here, K is set to 3 which means it finds 3
IV. Conclusion And Future Work
closest suggestions.
We have presented a comprehensive dialect converter for the
Table V Chittagonian dialect and demonstrated different structures of
LCS length examples. Chittagonian dialect. We have built a dataset of Chittagonian
Input Chittagonian D.M. LCS
words, implemented word to word mapping, morphological
Word Word Encoding Length rules for translation, and a module for word suggestion. Our
অেনরা onera 5 method yields an encouraging result at this stage of our work.
অনারা onara 4 The main limitation of our work is the size of the dataset. We
অেনরা ঐন্না oinna 3
আৈনননা anoinna 3
are working on increasing its size to attain better usability.
ইতারা itara 2 Learnability of the system is another big issue as we have
not used any machine learning algorithms for the system to
improve itself. We are working on the implementation of
III. Experimental Results And Performance Analysis Neural Networks for the system to make it more robust and
to increase the accuracy of the suggestion words. Authors are
In this section, we have demonstrated the results and per-
currently working on the implementation of an STT (Speech
formance analysis of translation and word suggestion module.
to Text) and a TTS (Text to Speech) for our system.
A. Translation Evaluation References
We have tested our system’s performance with 2,230 (11,042 [1] Amrita Das, “A Comparative Study of Bangla and Sylheti Grammar,”
words) Chittagonian sentences. We have divided these sen- PP. 389, Università degli Studi di Napoli Federico II, 2017.
tences into 4 smaller test sets and tested our system individ- [2] Muhammad Azizul Hoque, “Chittagonian Variety: Dialect, Language,
or Semi-Language?,”CRP, International Islamic University Chittagong,
ually. If there is a single mistake anywhere (word mapping, Bangladesh, 2015.
suffix/prefix rule, and punctuation) in the output sentence, we [3] Arvinder Singh and Parminder Singh, “Punjabi dialects conversion
have considered the whole sentence as an incorrect/erroneous system for Malwai and Doabi dialects,” Vol.8, PP.1–6, Indian Journal
of Science and Technology, 2015.
conversion. The results are shown in the table below. [4] K Marimuthu and Sobha Lalitha Devi, “Automatic conversion of dialec-
tal Tamil text to standard written Tamil text using FSTs,” PP. 37–45,
Table VI Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM,
Experimental Results. 2014.
[5] GH Al-Gaphari and M Al-Yadoumi, “A method to convert Sana’ani
Dataset Name Sentence Count Error Count Accuracy(%) accent to Modern Standard Arabic,” Vol.8, PP. 39–49, International
test1 557 26 95.332 Journal of Information Science and Management (IJISM), 2012.
test2 557 35 93.716 [6] Hitham Abo Bakr, Khaled Shaalan and Ibrahim Ziedan, “A hybrid
test3 558 29 94.803 approach for converting written Egyptian colloquial dialect into dia-
test4 558 27 95.161 critized Arabic,”The 6th international conference on informatics and
systems,Cairo university, 2008.
[7] Md Shahnur Azad Chowdhury, “Developing a Bangla to English Ma-
chine Translation System Using Parts Of Speech Tagging,” Vol.1, PP.
From the information available in table VI, we have calcu- 113-119, Journal of Modern Science and Technology, 2013.
lated an average conversion accuracy rate of 94.75%. [8] Naushad UzZaman and Muhit Khan, “A double metaphone encoding
From close observation, we have noticed that most of the for Bangla and its application in spelling checker,” PP. 705-710, In-
ternational Conference on Natural Language Processing and Knowledge
errors occurred due to one-to-many mapping of a word (i.e. Engineering, 2005.
repetition of the same word with different meanings). For [9] Lasse Bergroth, Harri Hakonen and Timo Raita, “A survey of longest
example, the word 'হক্কল' has two different meanings: 'সকল' common subsequence algorithms,” PP. 39-48, Proceedings Seventh In-
ternational Symposium on String Processing and Information Retrieval,
and 'সবাই'. Among other reasons were the lack of pure suffix SPIRE 2000.
transformation rules and the difference between ’Sadhu (সাধু )’ [10] Deena Nath, Jitendra Kurmi and Vipin Rawat, “A Survey on Longest
and ’Chalit (চিলত)’ accents. Common Subsequence,” vol. 6, International Journal for Research in
Applied Science & Engineering Technology (IJRASET), 2018.
[11] Sayali D. Jadhav and HP Channe,“Comparative study of K-NN, naive
B. Word Suggestion Evaluation Bayes and decision tree classification techniques,” Vol. 5, PP. 1842–
The Word Suggestion module evaluation process was a bit 1845,International Journal of Science and Research (IJSR), 2016.
[12] Noor Muhammad Rafiq, “Chottogramer Ancholik Bhashar OVID-
tricky since a particular input word may have different correct HAN,”ISBN:9789849107521, 2nd edition,2017.
suggestions based on different user input and requirements. [13] Min-Siong Liang, Ren-Yuan Lyu and Yuang-Chin Chiang, “Phonetic
There are no correct or incorrect results in this module. For transcription using speech recognition technique considering variations
in pronunciation,” Vol. 4, PP. IV–109, 2007 IEEE International Confer-
example, the input word 'ওেনরা' (misspelled) gave us the output ence on Acoustics, Speech and Signal Processing-ICASSP’07, 2007.
suggested words 'অেনরা', 'অনারা' and 'আনসাড়া'. The words [14] Naushad UzZaman and Mumit Khan. “A Double Metaphone Encoding
'অেনরা' and 'অনারা' both were pretty similar to the input. But for Approximate Name Searching and Matching in Bangla.” Computa-
tional Intelligence (2005).
the word 'আনসাড়া' seemed very dissimilar to the input. We

View publication stats

You might also like