Bangla Grammar
Bangla Grammar
Transformer Model
H.A.Z. Sameen Shahgir Khondker Salman Sayeed
Computer Science and Engineering Computer Science and Engineering
Bangladesh University of Engineering and Technology Bangladesh University of Engineering and Technology
[email protected] [email protected]
Abstract—This paper presents a method for detecting gram- seq2seq model. This approach achieves state-of-the-art results
arXiv:2303.10612v1 [cs.CL] 19 Mar 2023
matical errors in Bangla using a Text-to-Text Transfer Trans- on canonical GEC evaluation datasets based on F-score re-
former (T5) Language Model [?], using the small variant of sults. This makes it a valuable resource for individuals and
BanglaT5 [1], fine-tuned on a corpus of 9385 sentences where
errors were bracketed by the dedicated symbol $ [2]. The organizations that rely on written communication. However, it
T5 model was primarily designed for translation and is not is important to note that Grammarly and other similar tools
specifically designed for this task, so extensive post-processing are currently only available for a limited number of languages,
was necessary to adapt it to the task of error detection. Our primarily English.
experiments show that the T5 model can achieve low Levenshtein Some research work has been done in GED and GEC
Distance in detecting grammatical errors in Bangla, but post-
processing is essential to achieve optimal performance. The final in Bangla [4] [5] but to the best of our knowledge, no
average Levenshtein Distance after post-processing the output work leveraging transformer models has yet been done in
of the fine-tuned model was 1.0394 on a test set of 5000 Bangla. As mentioned before, GEC in English has already
sentences. This paper also presents a detailed analysis of the reached a commercially viable stage and notable progress has
errors detected by the model and discusses the challenges of been achieved using both seq2seq [6] [7] and BERT-based
adapting a translation model for grammar. Our approach can
be extended to other languages, demonstrating the potential of models [3]. Both deliver comparable performance [3] but the
T5 models for detecting grammatical errors in a wide range of seq2seq models are easier to train albeit with much slower
languages. inference. We ultimately decided on using the T5 model [8],
Index Terms—Bangla, Grammatical Error Detection, Machine pre-trained on a large Bangla corpus [1]. We tested both
Learning, T5 the base (220M parameters) and the small (60M parameters)
variants of BanglaT5 and found the smaller model to perform
I. I NTRODUCTION slightly better within our computing budget.
In an increasingly digital world, the ability to communicate T5 or Text-to-Text Transfer Transformer [8], is a Trans-
effectively in written form has become a crucial skill. With former based architecture that uses a text-to-text approach. It
the rise of digital communication platforms, such as email, adds a causal decoder to the bidirectional architecture of BERT
instant messaging, and social media, written communication [9]. The difference with the basic encoder-decoder transformer
has become more pervasive than ever before. However, with architecture [10] is that t5 uses relative positional embedding
this increased reliance on written communication comes a new and layer norm at the start of each block and the end of the
set of challenges, including the need for accurate and effective last block. Other than that, t5 and the basic encoder-decoder
grammar usage. transformers are the same in architecture. T5 was trained with
Grammar errors can impede effective communication and the goal of unifying all NLP tasks with a single text-to-text
have serious consequences, especially in professional and model. By that goal, banglat5 [1] was trained on a massive
academic settings where clarity and precision are paramount. Bengali pretraining corpus Bangla2B+ [11], sized 27.5GB.
Grammar errors can also impact the credibility of the writer This allows banglat5 to achieve state-of-the-art results on most
and create confusion for the reader. In recent years, the devel- Bengali text generation tasks. Therefore, leveraging transfer
opment of deep learning models for grammar error detection learning from an enormous Bengali text corpus — banglat5
(GED) and grammar error correction (GEC) has become an is an ideal candidate to consider for Bengali GED and GEC
increasingly important area of research. tasks.
One product of this extensive research is — Grammarly.
II. M ETHODOLOGY
It is one of the most ubiquitous grammar correction tools
available today, with millions of users around the world. This A. Model Selection
tool uses the GECToR model [3] for error detection and Currently, BERT (Bidirectional Encoder Representations
correction. This model implements a tagging-based approach from Transformers) and its variants are the best-performing
for error detection using an encoder and then uses a generative models on tasks such as token classification [12] and sentence
approach to correct that error based on the detection using a classification [13]. BanglaBERT reproduced this finding when
Fig. 1. Encoder based BERT architecture (left) vs Encoder-Decoder based text-to-text transformer architecture (right) (source:
https://jalammar.github.io/illustrated-transformer/)
if t5_input[i1] == t5_output[i2]:
corrected_output += t5_input[i1]
i1 += 1
i2 += 1
continue
if t5_output[i2] == "$":
corrected_output += "$"
i2 += 1
continue
if t5_output[i2] in character_lookup.keys():
if t5_input[i1] == character_lookup[t5_output[i2]]:
corrected_output += t5_input[i1]
i1 += 1
i2 += 1
continue
if attempt_word_corr == True:
t5_output = word_correction(t5_output, t5_input)
i1 = 0
i2 = 0
attempt_word_corr = False
continue
return regex_correction(t5_input)