Bangla Grammar

This document presents research on detecting grammatical errors in Bangla text using a T5 transformer model. The researchers fine-tuned a small BanglaT5 model on a corpus of over 9,000 Bangla sentences containing bracketed grammatical errors. Their experiments showed the T5 model could detect Bangla grammatical errors with low error, though post-processing was needed to achieve optimal performance. On a test set of 5,000 sentences, the average error after post-processing was 1.0394. This approach outperforms prior work and demonstrates the potential of T5 models for grammatical error detection in languages beyond English.

Uploaded by

Hemanth lepcha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views5 pages

Bangla Grammar

Uploaded by

Hemanth lepcha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Bangla Grammatical Error Detection Using T5

Transformer Model
H.A.Z. Sameen Shahgir Khondker Salman Sayeed
Computer Science and Engineering Computer Science and Engineering
Bangladesh University of Engineering and Technology Bangladesh University of Engineering and Technology
[email protected] [email protected]

Abstract—This paper presents a method for detecting gram- seq2seq model. This approach achieves state-of-the-art results
arXiv:2303.10612v1 [cs.CL] 19 Mar 2023

matical errors in Bangla using a Text-to-Text Transfer Trans- on canonical GEC evaluation datasets based on F-score re-
former (T5) Language Model [?], using the small variant of sults. This makes it a valuable resource for individuals and
BanglaT5 [1], fine-tuned on a corpus of 9385 sentences where
errors were bracketed by the dedicated symbol $ [2]. The organizations that rely on written communication. However, it
T5 model was primarily designed for translation and is not is important to note that Grammarly and other similar tools
specifically designed for this task, so extensive post-processing are currently only available for a limited number of languages,
was necessary to adapt it to the task of error detection. Our primarily English.
experiments show that the T5 model can achieve low Levenshtein Some research work has been done in GED and GEC
Distance in detecting grammatical errors in Bangla, but post-
processing is essential to achieve optimal performance. The final in Bangla [4] [5] but to the best of our knowledge, no
average Levenshtein Distance after post-processing the output work leveraging transformer models has yet been done in
of the fine-tuned model was 1.0394 on a test set of 5000 Bangla. As mentioned before, GEC in English has already
sentences. This paper also presents a detailed analysis of the reached a commercially viable stage and notable progress has
errors detected by the model and discusses the challenges of been achieved using both seq2seq [6] [7] and BERT-based
adapting a translation model for grammar. Our approach can
be extended to other languages, demonstrating the potential of models [3]. Both deliver comparable performance [3] but the
T5 models for detecting grammatical errors in a wide range of seq2seq models are easier to train albeit with much slower
languages. inference. We ultimately decided on using the T5 model [8],
Index Terms—Bangla, Grammatical Error Detection, Machine pre-trained on a large Bangla corpus [1]. We tested both
Learning, T5 the base (220M parameters) and the small (60M parameters)
variants of BanglaT5 and found the smaller model to perform
I. I NTRODUCTION slightly better within our computing budget.
In an increasingly digital world, the ability to communicate T5 or Text-to-Text Transfer Transformer [8], is a Trans-
effectively in written form has become a crucial skill. With former based architecture that uses a text-to-text approach. It
the rise of digital communication platforms, such as email, adds a causal decoder to the bidirectional architecture of BERT
instant messaging, and social media, written communication [9]. The difference with the basic encoder-decoder transformer
has become more pervasive than ever before. However, with architecture [10] is that t5 uses relative positional embedding
this increased reliance on written communication comes a new and layer norm at the start of each block and the end of the
set of challenges, including the need for accurate and effective last block. Other than that, t5 and the basic encoder-decoder
grammar usage. transformers are the same in architecture. T5 was trained with
Grammar errors can impede effective communication and the goal of unifying all NLP tasks with a single text-to-text
have serious consequences, especially in professional and model. By that goal, banglat5 [1] was trained on a massive
academic settings where clarity and precision are paramount. Bengali pretraining corpus Bangla2B+ [11], sized 27.5GB.
Grammar errors can also impact the credibility of the writer This allows banglat5 to achieve state-of-the-art results on most
and create confusion for the reader. In recent years, the devel- Bengali text generation tasks. Therefore, leveraging transfer
opment of deep learning models for grammar error detection learning from an enormous Bengali text corpus — banglat5
(GED) and grammar error correction (GEC) has become an is an ideal candidate to consider for Bengali GED and GEC
increasingly important area of research. tasks.
One product of this extensive research is — Grammarly.
II. M ETHODOLOGY
It is one of the most ubiquitous grammar correction tools
available today, with millions of users around the world. This A. Model Selection
tool uses the GECToR model [3] for error detection and Currently, BERT (Bidirectional Encoder Representations
correction. This model implements a tagging-based approach from Transformers) and its variants are the best-performing
for error detection using an encoder and then uses a generative models on tasks such as token classification [12] and sentence
approach to correct that error based on the detection using a classification [13]. BanglaBERT reproduced this finding when
Fig. 1. Encoder based BERT architecture (left) vs Encoder-Decoder based text-to-text transformer architecture (right) (source:
https://jalammar.github.io/illustrated-transformer/)

trained specifically on Bangla corpus [11]. Although GED B. Dataset Analysis

can be formulated as either a token classification problem The training set consisted of 19385 sentence pairs in total,
or a sentence classification problem, both possess several containing both error-free sentences and sentences with errors.
challenges. When presented as a token classification task, The major error types are: The major error types are:
punctuation becomes a particular issue since most punctua- 1) Single word error
tions represent a pause and are hard to distinguish. Another 2) Multi-word error
challenge is tokens which are missing altogether. It can be 3) Wrong Punctuation
hypothesized that BERT does have the ability to detect the 4) Punctuation Omission
logical inconsistency in a sentence that arises from missing 5) Merge Error
tokens due to its deep encoder architecture but marking the 6) Form/Inflection Error
position of missing tokens is a challenge. On the other hand, 7) Unwanted space error
when posed as a sentence classification problem, we find 8) Hybrid
that BERT can classify sentences are either error-free or
The errors are each bracketed by a designated symbol $ and
with errors well but cannot mark the erroneous section itself.
are not differentiated from each other.
Recently, sequence-to-sequence (seq2seq) models such as the
T5 [?] have achieved state-of-the-art performance on standard TABLE I
Grammatical Error Correction (GEC) benchmarks [14]. Such DATASET CHARACTERISTICS
models [6] [7] have been trained specifically on synthetic GEC
Split Total With Error Num. Errors
datasets (as opposed to general translation datasets). But since DataSetFold1 9385 3693 7133
the model must generate the entire output sequence, including DataSetFold2 10000 4393 7352
the parts which were correct to begin with, inference is slow. Test 5000 - -
The BERT-based GECToR [3] presents another way for GEC
- a token classification approach where errors are mapped to We used DataSetFold1 for the fine-tuning of the T5 model
the 5000 error-correcting transformations (one for each token and both DataSetFold1 and DataSetFold2 for the crucial post-
in vocabulary and some token independent transformations) processing steps.
which correct the errors algorithmically. The resulting model C. External Dataset
is up to 10 times faster than comparable seq2seq models but
as before, this requires a synthetic pretraining corpus. We collected a word list of 311 archaic Bangla verb words
which were consistently marked as errors in the training
For Bangla Grammar Error Detection we decided on the dataset. We collected said word list with the aid of Github
small variant of BanglaT5 [1] with 60M parameters. The Copilot. This data was used in our regular expression based
smaller model allowed for larger batch sizes, faster exper- approach to GED.
imentation and hyper-parameter tuning when compared to
the standard BanglaT5 model with 220M parameters while D. Pre-processing
delivering similar performance on our training set (9385 Not wanting to shift the distribution of the train set from the
pairs). Experimentation on the larger T5 models using the full test set, we kept pre-processing to a minimum. The sentences
available dataset (19385 pairs) and evaluating a BERT-based were normalized and tokenized using the normalizer and
approach similar to GECToR is left for future work. tokenizer used in pretraining [1]. One notable point is that
we omitted newline characters when present inside sentences third algorithm that does simple error word detection that the
since it interfere with the way the T5 model reads in sentences. model might have missed.
E. Training The first is for respelling and correcting the T5 output by
Through experimentation on an 80-20 split of DataSetFold1 comparing it character by character with the input sequence.
between training and validation set and using an effective batch Beginning with an empty string as corrected output, if the next
size of 128, we determined 120 epochs to be a good stopping character is a $ symbol, it is appended to the corrected output
point before the model starts to over-fit. Then we used the string. If the next character of the input and output sentence
entirety of DataSetFold1 for 120 epochs of training. Since the match, it is appended. If they do not match, the next character
task was to predict 5000 test sentences while training only on of the output string is looked up in a table and if present,
9385 training pairs, we determined that keeping any significant the value from the lookup table represents the correction
segment for validation and early stopping would be detrimental and is appended. This lookup table has been constructed
to overall performance. manually by observing common t5 errors. Constructing the
table automatically is left for future work.
2.2
If character-level corrections fail, the algorithm attempts to
2.1
make word-level corrections by replacing entire words in the t5
2 output and then character-level correction is attempted again.
1.9
Training Loss

The second algorithm is a regular expression-based ap-

1.8 proach to GED in case the first algorithm fails to correct
1.7 the T5 output. Certain common errors are learned from the
training dataset and identified in the test set using sub-string
1.6
replacements.
1.5
1.4 These two algorithms work in tandem to correct the t5
output. However, should a test sentence already be present in
1.3 the training dataset, then the error-marked sentence is directly
10 20 30 40 50 60 70 80 90 100110120
Epoch pulled from the training dataset using another lookup table.
In a real-world scenario, having a lookup table of the most
Fig. 2. Training Loss vs Epoch commonly mistaken sentences or phrases can significantly
speed up GED since the need for a large deep learning model
A naive attempt to train the model on the combined DataSet- is bypassed entirely.
Fold1 and DataSetFold2 dataset did not improve our score
on the test set. This is likely because introducing new data The pseudo-code for the algorithms is in the Appendix.
requires re-tuning the model hyper-parameters. For now, we
leave this as future work.
F. Post-processing
The T5 model was built on the paradigm of unifying all
NLP tasks under text-to-text classification and on that front, T5
achieves state-of-the-art results on many GLUE tasks. How-
ever, this paradigm does have its shortcoming. Of particular
importance in the task of GED when judged by Levensthein
Distance is the tendency of T5 models to spell words differ-
ently or sometimes change entire words with a close synonym
because reproducing the input sequence exactly is as important
as marking the errors. This is a particular problem in Bangla
GED since the language is still evolving and multiple spellings
of the same word are in use concurrently. Furthermore, there
exist several unicode characters representing the same Bangla
alphabet or symbol, further complicating the reconciliation of
the T5 output sequence with its input sequence.
To transform the raw T5 output to a form as close as
possible to the input sequence, we present two algorithms.
As an optional post-processing of all outputs, we present a
III. R ESULTS AND D ISCUSSION for Grammatical Error Detection and provides a foundation
Training the banglat5 small model with 60M parameters on for future work in this field.
9385 sentence pairs for 120 epochs with a batch size of 128
and learning rate of 5 × 10−4 with AdamW Optimizer and a
linear learning rate scheduler yielded a final Levensthein Score
of 1.0394 on 5000 test sentences. The effect of the multiple
post-processing steps is presented below, serving as a short
ablation study of our methodology. Average Levenshtein dis- R EFERENCES
tance data on the test dataset was collected from submissions
to EEE DAY 2023 Datathon. The private and public scores
are based on a 50-50 split of the 5000 test sentences. We
calculated the total Aggregated distance by averaging the two. [1] Abhik Bhattacharjee, Tahmid Hasan, Wasi Uddin Ahmad, and Ri-
fat Shahriyar. Banglanlg: Benchmarks and resources for evaluating
low-resource natural language generation in bangla. arXiv preprint
TABLE II arXiv:2205.11081, 2022.
AVERAGE L EVENSHTEIN S CORE
CC = C HARACTER C ORRECTION [2] Tasnim Nishat Islam Md Boktiar Mahbub Murad, Sushmit. Apurba
WC = W ORD C ORRECTION presents bhashabhrom: Eee day 2023 datathon, 2023.
R = R EGEX M ODEL [3] Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, and
L = L OOKUP TABLE Oleksandr Skurzhanskyi. Gector–grammatical error correction: tag, not
P2 = P OST P ROCESSING 2 rewrite. arXiv preprint arXiv:2005.12592, 2020.
Model Regex Match Private Public Aggregated [4] Mir Noshin Jahan, Anik Sarker, Shubra Tanchangya, and Mohammad
Raw - - 3.216 3.208 3.212 Abu Yousuf. Bangla real-word error detection and correction using bidi-
No Corr. - - 1.5072 1.5168 1.512 rectional lstm and bigram hybrid model. In Proceedings of International
R 5000 - 1.1896 1.1916 1.1906 Conference on Trends in Computational and Cognitive Engineering:
CC - - 1.1072 1.134 1.1206 Proceedings of TCCE 2020, pages 3–13. Springer, 2020.
CC+R 107 - 1.0732 1.1048 1.089 [5] Nahid Hossain, Salekul Islam, and Mohammad Nurul Huda. Devel-
CC+WC+R 42 - 1.072 1.1012 1.0866 opment of bangla spell and grammar checkers: Resource creation and
CC+WC+R+L 40 253 1.0224 1.0588 1.0416 evaluation. IEEE Access, 9:141079–141097, 2021.
CC+WC+R+L+P2 40 253 1.0224 1.0564 1.0394 [6] Roman Grundkiewicz, Marcin Junczys-Dowmunt, and Kenneth
Heafield. Neural grammatical error correction systems with
After character-level corrections, the T5 output still had a unsupervised pre-training on synthetic data. In Proceedings of
the Fourteenth Workshop on Innovative Use of NLP for Building
severe mismatch with the original input in 107 sentences. Educational Applications, pages 252–263, 2019.
These arise mainly from mainly two causes, entire words [7] Shun Kiyono, Jun Suzuki, Masato Mita, Tomoya Mizumoto, and Kentaro
replaced or sentences that exceed the maximum input token Inui. An empirical study of incorporating pseudo data into grammatical
limit (256) of the model. Using only the regex-based algorithm error correction. arXiv preprint arXiv:1909.00502, 2019.
yields a modest score of 1.1906. But using it to handle 107 [8] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan
sentences that couldn’t be corrected resulted in a significant Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Explor-
ing the limits of transfer learning with a unified text-to-text transformer.
improvement (1.0866). Finally, the lookup table also modestly The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
improves the Levensthein score (1.0394) by looking up 253 [9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
sentences with exact matches in the training dataset. Bert: Pre-training of deep bidirectional transformers for language un-
derstanding. arXiv preprint arXiv:1810.04805, 2018.
IV. C ONCLUSION AND F UTURE W ORK [10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
In conclusion, we trained a T5 model for Grammatical Error Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention
is all you need. Advances in neural information processing systems, 30,
Detection and evaluated its performance using Levenshtein 2017.
distance. Although it’s difficult to compare our results to [11] Abhik Bhattacharjee, Tahmid Hasan, Kazi Samin, M Sohel Rahman,
previous work that typically uses the F1 score, our model Anindya Iqbal, and Rifat Shahriyar. Banglabert: Combating embed-
achieved good performance on the dataset we used. However, ding barrier for low-resource language understanding. arXiv preprint
arXiv:2101.00204, 2021.
we acknowledge that we only used 50% of the dataset and the
[12] Shervin Malmasi, Anjie Fang, Besnik Fetahu, Sudipta Kar, and Oleg
entire dataset may have improved our results. Additionally, Rokhlenko. Semeval-2022 task 11: Multilingual complex named entity
using T5 base instead of T5 small may have improved our recognition (multiconer). In Proceedings of the 16th international
performance with hyperparameter tuning. workshop on semantic evaluation (SemEval-2022), pages 1412–1437,
2022.
We also noted that preprocessing could have rooted out
[13] Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, PYKL
spelling errors, leaving more difficult semantic errors for the Srinivas, Björn Gambäck, Tanmoy Chakraborty, Thamar Solorio, and
T5 model to handle. Moreover, we identified that the post- Amitava Das. Semeval-2020 task 9: Overview of sentiment analysis of
processing step could be automated to improve the perfor- code-mixed tweets. SemEval@ COLING, pages 774–790, 2020.
mance further. [14] Christopher Bryant, Mariano Felice, Øistein E Andersen, and Ted
Briscoe. The bea-2019 shared task on grammatical error correction.
Looking forward, we suggest exploring a BERT-based ap- In Proceedings of the Fourteenth Workshop on Innovative Use of NLP
proach like GECToR [3] for Grammatical Error Detection. for Building Educational Applications, pages 52–75, 2019.
Overall, our work demonstrates the potential of T5 models
V. A PPENDIX
A. Pseudocode for Correcting Mistatches in T5 Ouput
function t5_output_correction(t5_output, t5_input):
corrected_output = ""
i1 = 0
i2 = 0
attempt_word_corr = True
while True:
if i1 == len(t5_input) and i2 == len(t5_output):
return corrected_output

if t5_input[i1] == t5_output[i2]:
corrected_output += t5_input[i1]
i1 += 1
i2 += 1
continue

if t5_output[i2] == "$":
corrected_output += "$"
i2 += 1
continue

if t5_output[i2] in character_lookup.keys():
if t5_input[i1] == character_lookup[t5_output[i2]]:
corrected_output += t5_input[i1]
i1 += 1
i2 += 1
continue

if attempt_word_corr == True:
t5_output = word_correction(t5_output, t5_input)
i1 = 0
i2 = 0
attempt_word_corr = False
continue

return regex_correction(t5_input)