2023 Banglalp-1 2
2023 Banglalp-1 2
2023 Banglalp-1 2
Softmax Layer
BERT_Base
Softmax Layer
Di
F1 F2 Fi Fn
Dense Layer
Fi Globalmaxpooled vector
SemanticVec
1D conv of
Sequential 1D kernel size 3
CNN Sub-Model
SemanticNet
Concatenation
Wi of Character
Embeddings lenC
per word
fective pattern based word level spell checker, we formation. The prediction made from this branch
introduce an auxiliary loss based secondary branch is dependent solely on misspelled word pattern ex-
in BSpell. Each of the n SemanticVecs obtained tracted by SemanticNet. This enables SemanticNet
from the n input words are passed parallelly on to learn more meaningful word representation.
to a Softmax layer without any further modifica-
tion. The outputs obtained from this branch are 3.3 BERT Hybrid Pretraining
probability vectors similar to the main branch out-
put. The total loss of BSpell can be expressed as: In contemporary BERT pretraining methods, each
LT otal = LF inal + λ × LAuxiliary . We want our input word W ordi maybe kept intact or maybe
final loss to have greater impact on model weight replaced by a default mask word in a probabilis-
update as it is associated with the final prediction tic manner (Devlin et al., 2018; Liu et al., 2019).
made by BSpell. Hence, we impose the constraint BERT has to predict the masked words. Mistakes
0 < λ < 1. This secondary branch of BSpell does from the BERT side will contribute to loss value
not have any Transformer encoders through which accelerating backpropagation based weight update.
the input words can interact to produce context in- In this process, BERT learns to fill in the gaps,
which in turn teaches the model language context.
10
Word1 Word2 Word3 Word4 Word5 2017. The Hindi pretraining corpus consists of
Hindi Oscar Corpus 4 , preprocessed Wikipedia ar-
Semantic BERT
BSpell Model ticles 5 , HindiEnCorp05 dataset 6 and WMT Hindi
News Crawl data 7 (all of these are publicly avail-
able corpus). We have used Prothom-Alo 2017 on-
Word1 Word2 Word3 Word4 Word5 line newspaper dataset for Bangla SC training and
Char Masking Word Masking Char Masking
validation purpose. Our errors in this corpus have
been produced synthetically using the probabilistic
Figure 5: BERT hybrid pretraining algorithm described by Sifat et al. (2020). We fur-
ther validate our baselines and proposed methods
Sun et al. (2020) proposed incremental ways of on Hindi open source SC dataset, namely Tools-
pretraining the model for new NLP tasks. We take ForIL (Etoori et al., 2018). For real error dataset,
a more task specific approach for masking. In SC, we have collected a total of 6300 sentences from
recognizing noisy word pattern is important. But Nayadiganta 8 online newspaper. Then we have dis-
there is no provision for that in contemporary pre- tributed the dataset among ten participants. They
training schemes and so, we propose hybrid mask- have typed (in regular speed) each correct sentence
ing (see Figure 5). Among n input words in a sen- using English QWERTY keyboard producing natu-
tence, we randomly replace nW words with a mask ral spelling errors. It has taken 40 days to finish the
word M askW . Among the remaining n − nW labeling. Top words have been taken such that they
words, we choose nC words for character mask- cover at least 95% of the corresponding corpus.
ing. We choose mC characters at random from a
word having m characters to be replaced by a mask 4.3 BSpell Architecture Hyperparameters
character M askC during character masking. Such SemanticNet sub-model of BSpell consists of a
masked characters introduce noise in words and character level embedding layer producing a 40
helps BERT to understand the probable semantic size vector from each character, then 5 consec-
meaning of noisy/ misspelled words. utive layers each consisting of 1D convolution
(batch normalization and Relu activation in be-
4 Experimental Setup tween each pair of convolution layers) and fi-
4.1 Implemented Pretraining Schemes nally, a 1D global max pooling in order to ob-
tain SemanticVec representation from each input
We have experimented with three types of masking
word. The five 1D convolution layers consist
based pretraining schemes. During word masking
of (64, 2), (64, 3), (128, 3), (128, 3), (256, 4) con-
we randomly select 15% words of a sentence and
volution, respectively. The first and second ele-
replace those with a fixed mask word. During char-
ment of each tuple denote number of convolution
acter masking, we randomly select 50% words of
filters and kernel size, respectively. We provide a
a sentence. For each selected word, we randomly
weight of 0.3 (λ value of loss function) to the aux-
mask 30% of its characters by replacing each of
iliary loss. The main branch of BSpell is similar to
them with a special mask character. Finally, during
BERT_Base (Gong et al., 2019) in terms of stack-
hybrid masking, we randomly select 15% words
ing 12 Transformer encoders. Attention outputs
of a sentence and replace them with a fixed mask
from each Transformer is passed through a dropout
word. We randomly select 40% words from the
layer (Srivastava et al., 2014) with a dropout rate
remaining words. For these selected words, we
of 0.3 and then layer normalized (Ba et al., 2016).
randomly mask 25% of their characters.
We use Stochastic Gradient Descent (SGD) Opti-
4.2 Dataset Specification mizer with a learning rate of 0.001 for our model
weight update. We clip our gradient value and keep
We have used one Bangla and one Hindi corpus
it below 5.0 to avoid gradient exploding problem.
with over 5 million (5 M) sentences for BERT pre-
training (see Table 1). Bangla pretraining corpus 4
https://www.kaggle.com/abhishek/hindi-oscar-corpus
consists of Prothom Alo 2 articles dated from 2014- 5
https://www.kaggle.com/disisbig/hindi-wikipedia-
2017 and BDnews24 3 articles dated from 2015- articles-172k
6
http://hdl.handle.net/11858/00-097C-0000-0023-625F-0
2 7
https://www.prothomalo.com/ https://www.aclweb.org/anthology/W19-5301
3 8
https://bangla.bdnews24.com/ https://www.dailynayadiganta.com/
11
Error
Unique Unique Top Train Validation Unique
Datasets Word
Word Char Word Sample Sample Error Word
Percentage
Prothom-Alo
Bangla 262 K 73 35 K 1M 200 K 450 K 52%
Synthetic Error
Bangla Real
14.5 K 73 _ 4.3 K 2K 10 K 36%
Error
Bangla Pretrain
513 K 73 40 K 5.5 M _ _ _
Corpus
Hindi Synthetic
Error Corpus 20.5 K 77 15 K 75 K 16 K 5K 10%
(ToolsForIL)
Hindi Pretrain
370 K 77 40 K 5.5 M _ _ _
Corpus
Table 2: Comparing BERT based variants. Typical word masking based pretraining has been used on all these
variants. Real-Error (Fine Tuned) denotes fine tuning of the Bangla syn- thetic error dataset trained model on real
error dataset, while Real-Error (No Fine Tune) means directly validating synthetic error dataset trained model on
real error dataset without any further fine tuning.
guage through a fill in the gaps sort of approach. in spite of such large increase in parameter num-
SC is not all about filling in the gaps. It is also ber. Attn_Seq2seq LSTM model utilizes attention
about what the writer wants to say, i.e. being able mechanism at decoder side (Bahdanau et al., 2014).
to predict a word even if some of its characters are This model takes in misspelled sentence characters
blank (masked). Character masking takes a more as input and provides the correct sequence of char-
drastic approach by completely eliminating the fill acters as output (Etoori et al., 2018). Due to word
in the gap task. This approach masks a few of the level spelling correction evaluation, this model
characters residing in some of the input words of faces the same problems as BERT Seq2seq model
the sentence and asks BSpell to predict these noisy discussed in Subsection 5.2. Proposed BSpell out-
words’ original correct version. The lack of context performs these models by a large margin.
in such pretraining scheme puts negative effect on
performance over real error dataset experiments, 5.5 Ablation Study
where harsh errors exist and context is the only BSpell has three unique features - (1) secondary
feasible way of correcting such errors (see Table 3). branch with auxiliary loss (possible to remove
Hybrid masking focuses both on filling in word this branch), (2) 1D CNN based SemanticNet sub-
gaps and on filling in character gaps through pre- model (can be replaced by simple Byte Pair En-
diction of correct word and helps BSpell achieve coding (BPE) (Vaswani et al., 2017)) and (3) hy-
SOTA performance. brid pretraining (can be replaced by word masking
based pretraining). Table 5 demonstrates the results
5.4 BSpell vs Possible LSTM Variants we obtain after removing any one of these features.
BiLSTM is a many to many bidirectional LSTM In all cases, the results show a downward trend
(two layers) that takes in all n words of a sentence compared to the original architecture.
at once and predicts their correct version as output
(Schuster and Paliwal, 1997). During SC, BiL- 5.6 Existing Bangla Spell Checkers vs BSpell
STM takes in both previous and post context into Phonetic rule based SC takes a Bangla phonetic
consideration besides the writing pattern of each rule based hard coded approach (Saha et al., 2019),
word and shows reasonable performance (see Table where a hybrid of Soundex (UzZaman and Khan,
4). In Stacked BiLSTM, we stack twelve many 2004) and Metaphone (UzZaman and Khan, 2005)
to many bidirectional LSTMs instead of just two. algorithm has been used. Clustering based SC
We see marginal improvement in SC performance on the other hand follows some predefined rules
13
Synthetic Error Real-Error Real-Error Synthetic Error
Spell Checker
(Prothom-Alo) (No Fine Tune) (Fine Tuned) (Hindi)
Architecture
ACC F1 ACC F1 ACC F1 ACC F1
BiLSTM 81.9% 0.818 78.3% 0.781 81.1% 0.809 81.2% 0.809
Stacked BiLSTM 83.5% 0.832 80.1% 0.80 82.4% 0.822 82.7% 0.824
Attn_Seq2seq (Char) 20.5% 0.178 15.4% 0.129 17.3% 0.152 22.7% 0.216
BSpell 97.6% 0.971 87.8% 0.873 91.5% 0.911 97.2% 0.97
Table 4: Comparing LSTM based variants with hybrid pretrained BSpell. FastText word representation has been
used with LSTM portion of each architecture.
Table 5: Comparing BSpell with its variants created by removing one of its novel features
Synthetic Error Real-Error glish words. 20% of the words of the training set
Spell
(Prothom-Alo) (No Fine Tune) have been converted to spelling error based on this
Checker
ACC F1 ACC F1 confusion set. The authors created BEA-60K test
Phonetic 61.2% 0.582 43.5% 0.401 set from BEA-2019 shared task consisting of nat-
Clustering 52.3% 0.501 44.2% 0.412 ural English spelling errors. The best correction
BSpell 97.6% 0.971 87.8% 0.873 rate achieved by the authors was around 80% using
LSTM based ELMo model, whereas BSpell has
Table 6: Existing Bangla spell checkers vs BSpell achieved a correction rate of 86.2%. We have also
experimented with BERT_Base model on this test
set where we have used byte pair encoding as word
on word cluster formation, distance measurement
representation. BERT_Base has achieved an error
and correct word suggestion (Mandal and Hossain,
correction rate of 85.6%. It is clear that BSpell and
2017). Since these two SCs are not learning based,
BERT_Base do not have that much difference in
fine tuning is not applicable for them. They do
performance when it comes to English compared
not take misspelled word context into considera-
to Bangla and Hindi.
tion while correcting that word. As a result, their
performance is poor especially in Bangla real error 5.8 Effectiveness of SemanticNet
dataset (see Table 6). BSpell outperforms these
Bangla SCs by a wide margin.
4 W1
15
ing algorithm with improved initial center. In 2009 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Second International Workshop on Knowledge Dis- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
covery and Data Mining, pages 790–792. IEEE. Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
Xingyi Cheng, Weidi Xu, Kunlong Chen, Shaohua proach. arXiv preprint arXiv:1907.11692.
Jiang, Feng Wang, Taifeng Wang, Wei Chu, and Yuan
Qi. 2020. Spellgcn: Incorporating phonological and Prianka Mandal and BM Mainul Hossain. 2017.
visual similarities into language models for chinese Clustering-based bangla spell checker. In 2017 IEEE
spelling check. arXiv preprint arXiv:2004.14166. International Conference on Imaging, Vision & Pat-
tern Recognition (icIVPR), pages 1–6. IEEE.
Shamil Chollampatt and Hwee Tou Ng. 2017. Connect-
ing the dots: Towards human-level grammatical error Jan Noyes. 1983. The qwerty keyboard: A re-
correction. In Proceedings of the 12th Workshop view. International Journal of Man-Machine Studies,
on Innovative Use of NLP for Building Educational 18(3):265–281.
Applications, pages 327–333.
Sourav Saha, Faria Tabassum, Kowshik Saha, and Mar-
Shamil Chollampatt and Hwee Tou Ng. 2018. A multi- jana Akter. 2019. BANGLA SPELL CHECKER AND
layer convolutional encoder-decoder neural network SUGGESTION GENERATOR. Ph.D. thesis, United
for grammatical error correction. In Proceedings of International University.
the AAAI Conference on Artificial Intelligence, vol-
ume 32. Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-
tional recurrent neural networks. IEEE transactions
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and on Signal Processing, 45(11):2673–2681.
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand- Jonathon Shlens. 2014. A tutorial on principal compo-
ing. arXiv preprint arXiv:1810.04805. nent analysis. arXiv preprint arXiv:1404.1100.
Pravallika Etoori, Manoj Chinnakotla, and Radhika Md Habibur Rahman Sifat, Chowdhury Rafeed Rahman,
Mamidi. 2018. Automatic spelling correction for Mohammad Rafsan, and Hasibur Rahman. 2020.
resource-scarce languages using deep learning. In Synthetic error dataset generation mimicking bengali
Proceedings of ACL 2018, Student Research Work- writing pattern. In 2020 IEEE Region 10 Symposium
shop, pages 146–152, Melbourne, Australia. Associ- (TENSYMP), pages 1363–1366. IEEE.
ation for Computational Linguistics.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Wang, and Tieyan Liu. 2019. Efficient training of Dropout: a simple way to prevent neural networks
bert by progressively stacking. In International Con- from overfitting. The journal of machine learning
ference on Machine Learning, pages 2337–2346. research, 15(1):1929–1958.
PMLR.
Felix Stahlberg and Shankar Kumar. 2021. Synthetic
Yuzhong Hong, Xianguo Yu, Neng He, Nan Liu, and data generation for grammatical error correction
Junhui Liu. 2019. Faspell: A fast, adaptable, simple, with tagged corruption models. arXiv preprint
powerful chinese spell checker based on dae-decoder arXiv:2105.13318.
paradigm. In Proceedings of the 5th Workshop on
Noisy User-generated Text (W-NUT 2019), pages 160– Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao
169. Tian, Hua Wu, and Haifeng Wang. 2020. Ernie 2.0: A
continual pre-training framework for language under-
Sadidul Islam, Mst Farhana Sarkar, Towhid Hussain, standing. In Proceedings of the AAAI Conference on
Md Mehedi Hasan, Dewan Md Farid, and Swakkhar Artificial Intelligence, volume 34, pages 8968–8975.
Shatabda. 2018. Bangla sentence correction using
deep neural network based sequence to sequence Naushad UzZaman and Mumit Khan. 2004. A bangla
learning. In 2018 21st International Conference phonetic encoding for better spelling suggesions.
of Computer and Information Technology (ICCIT), Technical report, BRAC University.
pages 1–6. IEEE.
Naushad UzZaman and Mumit Khan. 2005. A double
Sai Muralidhar Jayanthi, Danish Pruthi, and Graham metaphone encoding for approximate name searching
Neubig. 2020. Neuspell: A neural spelling correction and matching in bangla.
toolkit. arXiv preprint arXiv:2010.11085.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Nur Hossain Khan, Gonesh Chandra Saha, Bappa Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Sarker, and Md Habibur Rahman. 2014. Checking Kaiser, and Illia Polosukhin. 2017. Attention is all
the correctness of bangla words using n-gram. Inter- you need. In Advances in neural information pro-
national Journal of Computer Application, 89(11). cessing systems, pages 5998–6008.
16
Dingmin Wang, Yi Tay, and Li Zhong. 2019.
Confusionset-guided pointer networks for Chinese
spelling check. In Proceedings of the 57th Annual
Meeting of the Association for Computational Lin-
guistics, pages 5780–5785, Florence, Italy. Associa-
tion for Computational Linguistics.
Jinhua Xiong, Qiao Zhang, Shuiyuan Zhang, Jianpeng
Hou, and Xueqi Cheng. 2015. Hanspeller: a unified
framework for chinese spelling correction. In In-
ternational Journal of Computational Linguistics &
Chinese Language Processing, Volume 20, Number
1, June 2015-Special Issue on Chinese as a Foreign
Language.
Shaohua Zhang, Haoran Huang, Jicong Liu, and Hang
Li. 2020. Spelling error correction with soft-masked
bert. arXiv preprint arXiv:2005.07421.
17