2023 Banglalp-1 2

BSpell: A CNN-Blended BERT Based Bangla Spell Checker
Chowdhury Rafeed Rahman MD. Hasibur Rahman

National University of Singapore United International University
e0823054@u.nus.edu
Samiha Zakir and Mohammed Rafsan Mohammed Eunus Ali

University of Texas Rio Grande Valley Bangladesh University of Engineering
and Technology
Abstract is slow. An accurate spell checker (SC) can be a

solution to this problem.
Bangla typing is mostly performed using
Existing Bangla SCs include phonetic rule (Uz-
English keyboard and can be highly erroneous
due to the presence of compound and similarly Zaman and Khan, 2004, 2005) and clustering based
pronounced letters. Spelling correction of a methods (Mandal and Hossain, 2017). These meth-
misspelled word requires understanding of ods do not take misspelled word context into con-
word typing pattern as well as the context of the sideration. Another N-gram based Bangla SC
word usage. A specialized BERT model named (Khan et al., 2014) takes only short range previous
BSpell has been proposed in this paper targeted context into consideration. Recent state-of-the-art
towards word for word correction in sentence
(SOTA) spell checkers have been developed for
level. BSpell contains an end-to-end trainable
CNN sub-model named SemanticNet along
Chinese language, where a character level confu-
with specialized auxiliary loss. This allows sion set (similar characters) guided sequence to
BSpell to specialize in highly inflected Bangla sequence (seq2seq) model has been proposed by
vocabulary in the presence of spelling errors. Wang et al. (2019). Another research used similar-
Furthermore, a hybrid pretraining scheme has ity mapping graph convolutional network in order
been proposed for BSpell that combines word to guide BERT based character by character par-
level and character level masking. Comparison allel correction (Cheng et al., 2020). Both these
on two Bangla and one Hindi spelling
methods require external knowledge and assump-
correction dataset shows the superiority of
our proposed approach. BSpell is available tion about confusing character pairs existing in the
as a Bangla spell checking tool via GitHub: language. The most recent Chinese SC offers an
https://github.com/Hasiburshanto/Bangla- assumption free BERT architecture where error
Spell-Checker. detection network based soft-masking is included
(Zhang et al., 2020). This model takes all N charac-
1 Introduction ters of a sentence as input and produces the correct
Bangla is the native language of 228 million peo- version of these N characters as output in a parallel
ple which makes it the sixth most spoken language manner.
in the world 1 . This Sanskrit originated language Incorrect Correct
has 11 vowels, 39 consonants, 11 modified vowels পরিকা (প+র+ ि+ক+া) পরীক্ষা (প+র+ী+ক+ ্‌+ষ+া ): Exam
and 170 compound characters (Sifat et al., 2020). বিশশ (ব+ি+শ+শ) বিশ্ব (ব+ি+শ+ ্‌+ব ): World
There is vast difference between Bangla grapheme ভাদর (ভ+া+দ+র) ভাদ্র (ভ+া+দ+ ্‌+র): month name
representation and phonetic utterance for many
commonly used words. As a result, fast typing Figure 1: Heterogeneous character number between
of Bangla yields frequent spelling mistakes. Al- error word and corresponding correctly spelled word
most all Bangla native speakers type using English
QWERTY layout keyboard (Noyes, 1983) which One of the limitations in developing Bangla SC
makes it difficult to type Bangla compound charac- using SOTA BERT based implementation (Zhang
ters, phonetically similar single characters and sim- et al., 2020) is that number of input and output
ilar pronounced modified vowels correctly. Thus characters in BERT has to be exactly the same.
Bangla typing speed, if error-free typing is desired, Such scheme is only capable of correcting substi-
1
https://www.babbel.com/en/magazine/the-10-most- tution type errors. As compound characters are
spoken-languages-in-the-world common in Bangla words, an error made due to the
7
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023), pages 7–17
December 7, 2023 ©2023 Association for Computational Linguistics
substitution of such characters also changes word through the use of multi-head attention mechanism
length (see the table in Figure 1). So, we introduce of stacked Transformer encoders (Vaswani et al.,
word level prediction in our proposed BERT based 2017). The model uses CNN based learnable Se-
model. manticNet sub-model to capture semantic meaning
Correct Incorrect
of both correct and misspelled words. BSpell also
সৈনিক ঘোড়া চড়ে যুদ্ধে গেল। সৈনিক ঘোরা চড়ে যুদ্ধে গেল।
uses specialized auxiliary loss to facilitate word
(Soldier went to war riding a horse) (Soldier went to war riding a visit) level pattern learning and vanishing gradient prob-
আসামী দোষ স্বীকার করল। আসামী দোষ শিকার করল। lem removal. We introduce hybrid pretrainingfor
(The criminal confessed crime) (The criminal hunted crime)
BSpell to capture both context and word error pat-
কাল আমাদের বার্ষিক পরীক্ষা। কাল আমাদের বার্ষিক পরিখা।
(Tomorrow is our final exam) (Tomorrow is our final trench)
tern. We perform detailed evaluation on three error
datasets that include a real life Bangla error dataset.
Figure 2: ample words that are correctly spelled acci- Our evaluation includes detailed analysis on pos-
dentally, but are context-wise incorrect. sible LSTM based SCs, SC variants of BERT and
existing classic Bangla SCs.
The table shown in Figure 2 illustrates the im-
portance of context in Bangla SC. Although the 2 Related Works
red marked words of this figure are the misspelled
versions of the corresponding green marked cor- Several studies on Bangla SC development have
rect words, these red words are valid Bangla words. been conducted in spite of Bangla being a low re-
But if we check these red words based on sen- source language. A phonetic encoding oriented
tence semantic context, we can realize that these Bangla word level SC based on Soundex algorithm
words have been produced accidentally because was proposed by UzZaman and Khan (2004). This
of spelling error. An effective SC has to consider encoding scheme was later modified to develop
word pattern, its prior context and its post context. a Double Metaphone encoding based Bangla SC
(UzZaman and Khan, 2005). They took into ac-
Misspelled: গরাম ক্রিশির অরর নিরভরশিল count major context-sensitive rules and consonant
Correct: গ্রাম কৃ ষির ওপর নির্ভরশীল
Meaning: Villages are dependent on agriculture
clusters while performing their encoding scheme.
Misspelled Correct Context Another word level Bangla SC able to handle both
গরাম গ্রাম (village) কৃ ষির typographical and phonetic errors was proposed by
ক্রিশির কৃ ষির (Agriculture) গ্রাম, নির্ভরশীল Mandal and Hossain (2017). An N gram model was
অরর ওপর (on) নির্ভরশীল
proposed by Khan et al. (2014) for checking sen-
নিরভরশিল নির্ভরশীল (dependent) ওপর
tence level Bangla word correctness. An encoder-
Figure 3: Necessity of understanding existing erroneous decoder based seq2seq model was proposed by
words for spelling correction of misspelled words Islam et al. (2018) for Bangla sentence correction
task which involved bad arrangement of words and
Spelling errors often span up to multiple words missing words, though this work did not include in-
in a sentence. Figure 3 provides an example where correct spelling. A recent study has included Hindi
all four words have been misspelled. The correc- and Telugu SC development, where mistakes are
tion of each word has context dependency on a few assumed to be made at character level (Etoori et al.,
other words of the same sentence. The problem is 2018). They have used attention based encoder-
that these words that form the correction context are decoder modeling as their approach.
also misspelled. The table in the figure shows the SOTA research in this domain involves Chinese
words to look at in order to correct each misspelled SCs as it is an error prone language due to its con-
word. In the original sentence (colored in red), all fusing word segmentation, phonetically and visu-
these words that need to be looked at for context ally similar but semantically different characters.
are misspelled. If a SC cannot understand the ap- A seq2seq model assisted by a pointer network
proximate underlying meaning of these misspelled was employed for character level spell checking
words, then we lose all context for correcting each where the network is guided by externally gener-
misspelled word which is undesirable. ated character confusion set (Wang et al., 2019).
We propose a word level BERT (Devlin et al., Another research incorporated phonological and vi-
2018) based model BSpell. This model is capa- sual similarity knowledge of Chinese characters
ble of learning prior and post context dependency into BERT based SC model by utilizing graph
8
convolutional network (Cheng et al., 2020). A level understanding of the words. We propose Se-
recent BERT based SC has taken advantage of manticNet, a sequential 1D CNN sub-model that
GRU (Gated Recurrent Unit) based soft masking is employed at each individual word level with a
mechanism and has achieved SOTA performance view to learning intra word syllable pattern. Details
in Chinese character level SC in spite of not provid- of individual word representation has been shown
ing any external knowledge to the network (Zhang in the bottom right corner of Figure 4. We repre-
et al., 2020). Another external knowledge free ap- sent each input word by a matrix (each character
proach namely FASPell used BERT based seq2seq represented as a one hot vector). We apply global
model (Hong et al., 2019). HanSpeller++ is notable max pooling on the final convolution layer out-
among initially implemented Chinese SCs (Xiong put feature matrix of SemanticNet which gives us
et al., 2015). It was an unified framework utilizing the SemanticVec vector representation of the input
a hidden Markov model. word. We get a similar SemanticVec representation
from each of our input words by independently ap-
3 Our Approach plying the same SemanticNet sub-model on each
of their matrix representations.
3.1 Problem Statement
Suppose, an input sentence consists of n words – 3.2.2 BERT_Base as Main Branch
W ord1 , W ord2 , . . . , W ordn . For each W ordi , Each of the SemanticVec vector representations ob-
we have to predict the right spelling, if W ordi ex- tained from the input words are passed parallelly
ists in the top-word list of our corpus. If W ordi is on to our first Transformer encoder. 12 such Trans-
a rare word (Proper Noun in most cases), we pre- former encoders are stacked on top of each other.
dict U N K token denoting that we do not make any Each Transformer employs multi head attention
correction to such words. For correcting a particu- mechanism, layer normalization and dense layer
lar W ordi in a paragraph, we only consider other specific modification on each input vector. The
words of the same sentence for context information. attention mechanism applied on the word feature
vectors in each transformer layer helps the words
3.2 BSpell Architecture of the input sentence interact with one another ex-
Figure 4 shows the details of BSpell architecture. tracting sentence context. We pass the final Trans-
Each input word of the sentence is passed through former layer output vectors to a dense layer with
the SemanticNet sub-model. This sub-model re- Softmax activation function applied on each vec-
turns us with a SemanticVec vector representation tor in an independent manner. So, now we have
for each input word. These vectors are then passed n probability vectors from n words of the input
onto two separate branches (main branch and sec- sentence. Each probability vector contains lenP
ondary branch) simultaneously. The main branch values, where lenP is one more than the total num-
is similar to BERT_Base architecture (Gong et al., ber of top words considered (the additional word
2019). This branch provides us with the n correct represents rare words). The top word correspond-
words corresponding to the n input sentence words ing to the index of the maximum probability value
at its output side. The secondary branch consists of of ith probability vector represents the correct word
an output dense layer. This branch is used for the for W ordi of the input sentence.
sole purpose of imposing auxiliary loss to facili-
tate SemanticNet sub-model learning of misspelled 3.2.3 Auxiliary Loss in Secondary Branch
word patterns. Gradient vanishing problem is a common phenom-
ena in deep neural networks, where weights of the
3.2.1 SemanticNet Sub-Model shallow layers are not updated sufficiently during
Correcting a particular word requires the under- backpropagation. With the presence of 12 Trans-
standing of other relevant words in the same sen- former encoders on top of the SemanticNet sub-
tence. Unfortunately, those relevant words may model, the layers of this sub-model certainly lie in
also be misspelled. As humans, we can understand a shallow position. Although SemanticNet consti-
the meaning of a word even if it is misspelled be- tutes a small initial portion of BSpell, this portion is
cause of our deep understanding at word syllable responsible for word pattern learning, an important
level and our knowledge of usual spelling error pat- task of SC. In order to eliminate gradient vanishing
tern. We want our model to have similar semantic problem of SemanticNet and to turn it into an ef-
9
Final Output
Word1 Word2 Wordi Wordn
Softmax Layer
Transformer Encoder (12)
BERT_Base
Transformer Encoder (2) Auxiliary Output
Word1 Word2 Wordi Wordn

Transformer Encoder (1)
Softmax Layer
Di
F1 F2 Fi Fn
Dense Layer
Fi Globalmaxpooled vector
SemanticVec
1D conv of
Sequential 1D kernel size 3
CNN Sub-Model
SemanticNet
Concatenation
Wi of Character
Embeddings lenC
per word
Word1 Word2 Wordi Wordn C1 C2 C3 C4 Cm
Figure 4: BSpell architecture details
fective pattern based word level spell checker, we formation. The prediction made from this branch
introduce an auxiliary loss based secondary branch is dependent solely on misspelled word pattern ex-
in BSpell. Each of the n SemanticVecs obtained tracted by SemanticNet. This enables SemanticNet
from the n input words are passed parallelly on to learn more meaningful word representation.
to a Softmax layer without any further modifica-
tion. The outputs obtained from this branch are 3.3 BERT Hybrid Pretraining
probability vectors similar to the main branch out-
put. The total loss of BSpell can be expressed as: In contemporary BERT pretraining methods, each
LT otal = LF inal + λ × LAuxiliary . We want our input word W ordi maybe kept intact or maybe
final loss to have greater impact on model weight replaced by a default mask word in a probabilis-
update as it is associated with the final prediction tic manner (Devlin et al., 2018; Liu et al., 2019).
made by BSpell. Hence, we impose the constraint BERT has to predict the masked words. Mistakes
0 < λ < 1. This secondary branch of BSpell does from the BERT side will contribute to loss value
not have any Transformer encoders through which accelerating backpropagation based weight update.
the input words can interact to produce context in- In this process, BERT learns to fill in the gaps,
which in turn teaches the model language context.
10
Word1 Word2 Word3 Word4 Word5 2017. The Hindi pretraining corpus consists of
Hindi Oscar Corpus 4 , preprocessed Wikipedia ar-
Semantic BERT
BSpell Model ticles 5 , HindiEnCorp05 dataset 6 and WMT Hindi
News Crawl data 7 (all of these are publicly avail-
able corpus). We have used Prothom-Alo 2017 on-
Word1 Word2 Word3 Word4 Word5 line newspaper dataset for Bangla SC training and
Char Masking Word Masking Char Masking
validation purpose. Our errors in this corpus have
been produced synthetically using the probabilistic
Figure 5: BERT hybrid pretraining algorithm described by Sifat et al. (2020). We fur-
ther validate our baselines and proposed methods
Sun et al. (2020) proposed incremental ways of on Hindi open source SC dataset, namely Tools-
pretraining the model for new NLP tasks. We take ForIL (Etoori et al., 2018). For real error dataset,
a more task specific approach for masking. In SC, we have collected a total of 6300 sentences from
recognizing noisy word pattern is important. But Nayadiganta 8 online newspaper. Then we have dis-
there is no provision for that in contemporary pre- tributed the dataset among ten participants. They
training schemes and so, we propose hybrid mask- have typed (in regular speed) each correct sentence
ing (see Figure 5). Among n input words in a sen- using English QWERTY keyboard producing natu-
tence, we randomly replace nW words with a mask ral spelling errors. It has taken 40 days to finish the
word M askW . Among the remaining n − nW labeling. Top words have been taken such that they
words, we choose nC words for character mask- cover at least 95% of the corresponding corpus.
ing. We choose mC characters at random from a
word having m characters to be replaced by a mask 4.3 BSpell Architecture Hyperparameters
character M askC during character masking. Such SemanticNet sub-model of BSpell consists of a
masked characters introduce noise in words and character level embedding layer producing a 40
helps BERT to understand the probable semantic size vector from each character, then 5 consec-
meaning of noisy/ misspelled words. utive layers each consisting of 1D convolution
(batch normalization and Relu activation in be-
4 Experimental Setup tween each pair of convolution layers) and fi-
4.1 Implemented Pretraining Schemes nally, a 1D global max pooling in order to ob-
tain SemanticVec representation from each input
We have experimented with three types of masking
word. The five 1D convolution layers consist
based pretraining schemes. During word masking
of (64, 2), (64, 3), (128, 3), (128, 3), (256, 4) con-
we randomly select 15% words of a sentence and
volution, respectively. The first and second ele-
replace those with a fixed mask word. During char-
ment of each tuple denote number of convolution
acter masking, we randomly select 50% words of
filters and kernel size, respectively. We provide a
a sentence. For each selected word, we randomly
weight of 0.3 (λ value of loss function) to the aux-
mask 30% of its characters by replacing each of
iliary loss. The main branch of BSpell is similar to
them with a special mask character. Finally, during
BERT_Base (Gong et al., 2019) in terms of stack-
hybrid masking, we randomly select 15% words
ing 12 Transformer encoders. Attention outputs
of a sentence and replace them with a fixed mask
from each Transformer is passed through a dropout
word. We randomly select 40% words from the
layer (Srivastava et al., 2014) with a dropout rate
remaining words. For these selected words, we
of 0.3 and then layer normalized (Ba et al., 2016).
randomly mask 25% of their characters.
We use Stochastic Gradient Descent (SGD) Opti-
4.2 Dataset Specification mizer with a learning rate of 0.001 for our model
weight update. We clip our gradient value and keep
We have used one Bangla and one Hindi corpus
it below 5.0 to avoid gradient exploding problem.
with over 5 million (5 M) sentences for BERT pre-
training (see Table 1). Bangla pretraining corpus 4
https://www.kaggle.com/abhishek/hindi-oscar-corpus
consists of Prothom Alo 2 articles dated from 2014- 5
https://www.kaggle.com/disisbig/hindi-wikipedia-
2017 and BDnews24 3 articles dated from 2015- articles-172k
6
http://hdl.handle.net/11858/00-097C-0000-0023-625F-0
2 7
https://www.prothomalo.com/ https://www.aclweb.org/anthology/W19-5301
3 8
https://bangla.bdnews24.com/ https://www.dailynayadiganta.com/
11
Error
Unique Unique Top Train Validation Unique
Datasets Word
Word Char Word Sample Sample Error Word
Percentage
Prothom-Alo
Bangla 262 K 73 35 K 1M 200 K 450 K 52%
Synthetic Error
Bangla Real
14.5 K 73 _ 4.3 K 2K 10 K 36%
Error
Bangla Pretrain
513 K 73 40 K 5.5 M _ _ _
Corpus
Hindi Synthetic
Error Corpus 20.5 K 77 15 K 75 K 16 K 5K 10%
(ToolsForIL)
Hindi Pretrain
370 K 77 40 K 5.5 M _ _ _
Corpus
Table 1: Dataset specification details
5 Results and Discussion characters is predicted incorrectly. Hence character

level seq2seq modeling achieves poor result (see
5.1 Training and Validation Details Table 2). Moreover, in most cases during sentence
In case of Bangla SC, we randomly initialize the level spell checking, the correct spelling of the ith
weights of model M . We use our large Bangla word of input sentence has to be the ith word in the
pretrain corpus for hybrid pretraining and get pre- output sentence as well. Such constraint is difficult
trained model Mpre . Next we split our benchmark to follow through such architecture design. BERT
synthetic spelling error dataset (Prothom-Alo) into Base consisting of stacked Transformer encoders
80%-20% training-validation set. We fine tune has two differences from the design proposed by
Mpre using the 80% training portion (obtaining Cheng et al. (2020) - (i) We make predictions at
fine tuned model Mf ine ) and report performance word level instead of character level (ii) We do not
on the remaining 20% validation portion. We use incorporate any external knowledge about Bangla
the Bangla real spelling error dataset in two ways - SC since such knowledge is not well established in
(1) We do not fine tune Mf ine on any of part of this the field. This approach achieves good performance
data and use the entire dataset as an independent in all four cases. Soft Masked BERT learns to ap-
test set (result reported with the title real error (no ply specialized synthetic masking on error prone
fine tune)) (2) We split this real error dataset into words in order to push the error correction per-
80%-20% training-validation and fine tune Mf ine formance of BERT Base further. The error prone
further using the 80% portion, then validate on the words are detected using a GRU sub-model and the
remaining 20% (result reported with the title real whole architecture is trained end to end. Although
error (fine tuned)). In case of Hindi, the first two Zhang et al. (2020) implemented this architecture
steps (pretraining and fine tuning) are the same. to make corrections at character level, our imple-
We have not constructed any real life spelling error mentation does everything in word level. We have
dataset for Hindi. So, results are reported on the used popular FastText (Athiwaratkun et al., 2018)
20% held out portion of the benchmark dataset. word representation for both BERT Base and Soft
Masked BERT. BSpell shows decent performance
5.2 BSpell vs Contemporary BERT Variants improvement in all cases.
We start with BERT Seq2seq where the encoder
5.3 Comparing BSpell Pretraining Schemes
and decoder portion consist of 12 stacked Trans-
formers (Devlin et al., 2018). Predictions are made We have implemented three different pretraining
at character level. Similar architecture has been schemes (details provided in Subsection 4.1) on
used in FASpell (Hong et al., 2019) for Chinese BSpell before fine tuning on spell checker dataset.
SC. A word is considered wrong if even one of its Word masking teaches BSpell context of a lan-
12
Synthetic Error Real-Error Real-Error Synthetic Error
Spell Checker (Prothom-Alo) (No Fine Tune) (Fine Tuned) (Hindi)
Architecture ACC F1 ACC F1 ACC F1 ACC F1
BERT Seq2seq 31.6% 0.305 24.5% 0.224 29.3% 0.278 22.8% 0.209
BERT Base 91.1% 0.902 83% 0.823 87.6% 0.855 93.8% 0.923
Soft Masked BERT 92% 0.919 84.2% 0.832 88.1% 0.862 94% 0.933
BSpell 94.7% 0.934 86.1% 0.859 90.1% 0.898 96.2% 0.96
Table 2: Comparing BERT based variants. Typical word masking based pretraining has been used on all these
variants. Real-Error (Fine Tuned) denotes fine tuning of the Bangla synthetic error dataset trained model on real
error dataset, while Real-Error (No Fine Tune) means directly validating synthetic error dataset trained model on
real error dataset without any further fine tuning.

Pretraining (Prothom-Alo) (No Fine Tune) (Fine Tuned) (Hindi)
Scheme ACC F1 ACC F1 ACC F1 ACC F1
Word Masking 94.7% 0.934 86.1% 0.859 90.1% 0.898 96.2% 0.96
Character Masking 95.6% 0.952 85.3% 0.851 89.2% 0.889 96.4% 0.963
Hybrid Masking 97.6% 0.971 87.8% 0.873 91.5% 0.911 97.2% 0.97
Table 3: Comparing BSpell exposed to various pretraining schemes
guage through a fill in the gaps sort of approach. in spite of such large increase in parameter num-
SC is not all about filling in the gaps. It is also ber. Attn_Seq2seq LSTM model utilizes attention
about what the writer wants to say, i.e. being able mechanism at decoder side (Bahdanau et al., 2014).
to predict a word even if some of its characters are This model takes in misspelled sentence characters
blank (masked). Character masking takes a more as input and provides the correct sequence of char-
drastic approach by completely eliminating the fill acters as output (Etoori et al., 2018). Due to word
in the gap task. This approach masks a few of the level spelling correction evaluation, this model
characters residing in some of the input words of faces the same problems as BERT Seq2seq model
the sentence and asks BSpell to predict these noisy discussed in Subsection 5.2. Proposed BSpell out-
words’ original correct version. The lack of context performs these models by a large margin.
in such pretraining scheme puts negative effect on
performance over real error dataset experiments, 5.5 Ablation Study
where harsh errors exist and context is the only BSpell has three unique features - (1) secondary
feasible way of correcting such errors (see Table 3). branch with auxiliary loss (possible to remove
Hybrid masking focuses both on filling in word this branch), (2) 1D CNN based SemanticNet sub-
gaps and on filling in character gaps through pre- model (can be replaced by simple Byte Pair En-
diction of correct word and helps BSpell achieve coding (BPE) (Vaswani et al., 2017)) and (3) hy-
SOTA performance. brid pretraining (can be replaced by word masking
based pretraining). Table 5 demonstrates the results
5.4 BSpell vs Possible LSTM Variants we obtain after removing any one of these features.
BiLSTM is a many to many bidirectional LSTM In all cases, the results show a downward trend
(two layers) that takes in all n words of a sentence compared to the original architecture.
at once and predicts their correct version as output
(Schuster and Paliwal, 1997). During SC, BiL- 5.6 Existing Bangla Spell Checkers vs BSpell
STM takes in both previous and post context into Phonetic rule based SC takes a Bangla phonetic
consideration besides the writing pattern of each rule based hard coded approach (Saha et al., 2019),
word and shows reasonable performance (see Table where a hybrid of Soundex (UzZaman and Khan,
4). In Stacked BiLSTM, we stack twelve many 2004) and Metaphone (UzZaman and Khan, 2005)
to many bidirectional LSTMs instead of just two. algorithm has been used. Clustering based SC
We see marginal improvement in SC performance on the other hand follows some predefined rules
13
Spell Checker
(Prothom-Alo) (No Fine Tune) (Fine Tuned) (Hindi)
Architecture
ACC F1 ACC F1 ACC F1 ACC F1
BiLSTM 81.9% 0.818 78.3% 0.781 81.1% 0.809 81.2% 0.809
Stacked BiLSTM 83.5% 0.832 80.1% 0.80 82.4% 0.822 82.7% 0.824
Attn_Seq2seq (Char) 20.5% 0.178 15.4% 0.129 17.3% 0.152 22.7% 0.216
BSpell 97.6% 0.971 87.8% 0.873 91.5% 0.911 97.2% 0.97
Table 4: Comparing LSTM based variants with hybrid pretrained BSpell. FastText word representation has been
used with LSTM portion of each architecture.

BSpell
(Prothom-Alo) (No Fine Tune) (Fine Tuned) (Hindi)
Variants
ACC F1 ACC F1 ACC F1 ACC F1
Original 97.6% 0.971 87.8% 0.873 91.5% 0.911 97.2% 0.97
No Aux Loss 96.3% 0.96 86.9% 0.865 90.5% 0.90 95.4% 0.949
No SemanticNet 94.5% 0.94 85.7% 0.848 89.2% 0.885 95.2% 0.95
No Hybrid Pretrain 94.7% 0.934 86.1% 0.859 90.1% 0.898 96.2% 0.96
Table 5: Comparing BSpell with its variants created by removing one of its novel features
Synthetic Error Real-Error glish words. 20% of the words of the training set
Spell
(Prothom-Alo) (No Fine Tune) have been converted to spelling error based on this
Checker
ACC F1 ACC F1 confusion set. The authors created BEA-60K test
Phonetic 61.2% 0.582 43.5% 0.401 set from BEA-2019 shared task consisting of nat-
Clustering 52.3% 0.501 44.2% 0.412 ural English spelling errors. The best correction
BSpell 97.6% 0.971 87.8% 0.873 rate achieved by the authors was around 80% using
LSTM based ELMo model, whereas BSpell has
Table 6: Existing Bangla spell checkers vs BSpell achieved a correction rate of 86.2%. We have also
experimented with BERT_Base model on this test
set where we have used byte pair encoding as word
on word cluster formation, distance measurement
representation. BERT_Base has achieved an error
and correct word suggestion (Mandal and Hossain,
correction rate of 85.6%. It is clear that BSpell and
2017). Since these two SCs are not learning based,
BERT_Base do not have that much difference in
fine tuning is not applicable for them. They do
performance when it comes to English compared
not take misspelled word context into considera-
to Bangla and Hindi.
tion while correcting that word. As a result, their
performance is poor especially in Bangla real error 5.8 Effectiveness of SemanticNet
dataset (see Table 6). BSpell outperforms these
Bangla SCs by a wide margin.
4 W1
5.7 Is BSpell Language Specific? 2 W5 W7
BSpell has originally been designed keeping the 0
unique characteristics of Sanskrit originated lan- 2

guages such as Bangla and Hindi in mind. Here W2
W10
4
we see how this model performs on English which W6 W9
W8 W3 W4
is very different from Bangla in terms of struc- 7.5 5.0 2.5 0.0 2.5 5.0 7.5
ture. We experiment on an English spelling error
dataset published by Jayanthi et al. (2020). The Figure 6: Visualizing SemanticVec representation of 10
training set consists of 1.6 million sentences. The popular words with their error variants
authors created a confusion set consisting of 109K
misspelled-correct word pairs for 17K popular En- The main motivation behind the inclusion of
14
SemanticNet in BSpell is to obtain vector repre- As a result, this spell checker will face problems
sentations of error words as close as possible to while correcting spelling errors in rare words. For
their corresponding correct words. We take 10 fre- such rare words, BSpell simply provides UNK as
quently occurring Bangla words and collect three output which means that it is not sure what to do
real life error variations of each of these words. with these words. An advantage here is that most
We produce SemanticVec representation of all 40 of these rare words are some form of proper nouns
of these words using SemanticNet. We use princi- which should not be corrected and should ideally
pal component analysis (PCA) (Shlens, 2014) on be left alone as they are. For example, someone
each of these SemanticVecs and plot them in two may have an uncommon name. We do not want
dimensions. Finally, we implement K-Means Clus- our model to correct that person’s name to some
tering algorithm using careful initialization with commonly used name.
K = 10 (Chen and Xia, 2009). Figure 6 shows An immediate research direction is to overcome
the 10 clusters obtained from this algorithm. Each the limitations of the proposed method. A straight-
cluster consists of a popular word and its three er- forward way of dealing with the word merge,
ror variations. In all cases, the correct word and its word split and rare word correction problem is to
three error versions are so close in the graph plot model spelling errors at character level (sequence-
that they almost form a single point. to-sequence type approach). We have taken this
trivial attempt and have failed miserably (see the
6 Conclusion performance reported in the first row of Table 2).
Solving these problems while maintaining the cur-
In this paper, we have proposed a SC named BSpell
rent spelling correction performance of BSpell can
for Bangla and Hindi language. BSpell uses Seman-
be a challenge. Another interesting future direction
ticVec representation of input misspelled words and
is to investigate on personalized Bangla and Hindi
a specialized auxiliary loss for the enhancement
spell checker which has the ability to take user
of spelling correction performance. The model ex-
personal preference and writing behaviour into ac-
ploits the concept of hybrid masking based pretrain-
count. The main challenge here is to effectively uti-
ing. We have also investigated into the limitations
lize user provided data that must be collected in an
of existing Bangla SCs as well as other SOTA SCs
online setting. Recently, deep learning based auto-
proposed for high resource languages. BSpell has
matic grammatical error correction has gained a lot
two main limitations - (a) it cannot handle acci-
of attention in English language (Chollampatt and
dental merge or split of words and (b) it cannot
Ng, 2018), (Chollampatt and Ng, 2017), (Stahlberg
correct misspelled rare words. A potential research
and Kumar, 2021). SOTA grammar correction mod-
direction can be to eradicate these limitations by
els developed for English can be trained and tested
designing models that can perform prediction at
on Bangla and Hindi spell checking tasks as part of
sub-word level which includes white space charac-
future research effort. Such benchmarking studies
ters and punctuation marks.
can play a vital role in pushing the boundaries of
7 Limitations low resource language correction automation.
BSpell model provides a word for word correction,

i.e., number of input words and number of output References
words have to be exactly the same. Unfortunately, Ben Athiwaratkun, Andrew Gordon Wilson, and An-
during accidental word merging or word splitting, ima Anandkumar. 2018. Probabilistic fasttext for
number of input and output words differ and so multi-sense word embeddings. arXiv preprint
arXiv:1806.02901.
in such cases BSpell will fail in resolving such er-
rors. This type of error is more common in Chinese Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin-
language. The advantage for us is that this type ton. 2016. Layer normalization. arXiv preprint
of error is rare in Bangla and Hindi as the words arXiv:1607.06450.
of these languages are clearly spaced in sentences. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-
So, people will rarely perform accidental merge or gio. 2014. Neural machine translation by jointly
split of words. Another limitation is that BSpell learning to align and translate. arXiv preprint
arXiv:1409.0473.
has been trained to correct only the top Bangla and
Hindi words that cover 95% of the entire corpus. Zhang Chen and Shixiong Xia. 2009. K-means cluster-
15
ing algorithm with improved initial center. In 2009 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Second International Workshop on Knowledge Dis- dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
covery and Data Mining, pages 790–792. IEEE. Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
Xingyi Cheng, Weidi Xu, Kunlong Chen, Shaohua proach. arXiv preprint arXiv:1907.11692.
Jiang, Feng Wang, Taifeng Wang, Wei Chu, and Yuan
Qi. 2020. Spellgcn: Incorporating phonological and Prianka Mandal and BM Mainul Hossain. 2017.
visual similarities into language models for chinese Clustering-based bangla spell checker. In 2017 IEEE
spelling check. arXiv preprint arXiv:2004.14166. International Conference on Imaging, Vision & Pat-
tern Recognition (icIVPR), pages 1–6. IEEE.
Shamil Chollampatt and Hwee Tou Ng. 2017. Connect-
ing the dots: Towards human-level grammatical error Jan Noyes. 1983. The qwerty keyboard: A re-
correction. In Proceedings of the 12th Workshop view. International Journal of Man-Machine Studies,
on Innovative Use of NLP for Building Educational 18(3):265–281.
Applications, pages 327–333.
Sourav Saha, Faria Tabassum, Kowshik Saha, and Mar-
Shamil Chollampatt and Hwee Tou Ng. 2018. A multi- jana Akter. 2019. BANGLA SPELL CHECKER AND
layer convolutional encoder-decoder neural network SUGGESTION GENERATOR. Ph.D. thesis, United
for grammatical error correction. In Proceedings of International University.
the AAAI Conference on Artificial Intelligence, vol-
ume 32. Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-
tional recurrent neural networks. IEEE transactions
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and on Signal Processing, 45(11):2673–2681.
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand- Jonathon Shlens. 2014. A tutorial on principal compo-
ing. arXiv preprint arXiv:1810.04805. nent analysis. arXiv preprint arXiv:1404.1100.
Pravallika Etoori, Manoj Chinnakotla, and Radhika Md Habibur Rahman Sifat, Chowdhury Rafeed Rahman,
Mamidi. 2018. Automatic spelling correction for Mohammad Rafsan, and Hasibur Rahman. 2020.
resource-scarce languages using deep learning. In Synthetic error dataset generation mimicking bengali
Proceedings of ACL 2018, Student Research Work- writing pattern. In 2020 IEEE Region 10 Symposium
shop, pages 146–152, Melbourne, Australia. Associ- (TENSYMP), pages 1363–1366. IEEE.
ation for Computational Linguistics.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,
Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
Wang, and Tieyan Liu. 2019. Efficient training of Dropout: a simple way to prevent neural networks
bert by progressively stacking. In International Con- from overfitting. The journal of machine learning
ference on Machine Learning, pages 2337–2346. research, 15(1):1929–1958.
PMLR.
Felix Stahlberg and Shankar Kumar. 2021. Synthetic
Yuzhong Hong, Xianguo Yu, Neng He, Nan Liu, and data generation for grammatical error correction
Junhui Liu. 2019. Faspell: A fast, adaptable, simple, with tagged corruption models. arXiv preprint
powerful chinese spell checker based on dae-decoder arXiv:2105.13318.
paradigm. In Proceedings of the 5th Workshop on
Noisy User-generated Text (W-NUT 2019), pages 160– Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Hao
169. Tian, Hua Wu, and Haifeng Wang. 2020. Ernie 2.0: A
continual pre-training framework for language under-
Sadidul Islam, Mst Farhana Sarkar, Towhid Hussain, standing. In Proceedings of the AAAI Conference on
Md Mehedi Hasan, Dewan Md Farid, and Swakkhar Artificial Intelligence, volume 34, pages 8968–8975.
Shatabda. 2018. Bangla sentence correction using
deep neural network based sequence to sequence Naushad UzZaman and Mumit Khan. 2004. A bangla
learning. In 2018 21st International Conference phonetic encoding for better spelling suggesions.
of Computer and Information Technology (ICCIT), Technical report, BRAC University.
pages 1–6. IEEE.
Naushad UzZaman and Mumit Khan. 2005. A double
Sai Muralidhar Jayanthi, Danish Pruthi, and Graham metaphone encoding for approximate name searching
Neubig. 2020. Neuspell: A neural spelling correction and matching in bangla.
toolkit. arXiv preprint arXiv:2010.11085.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Nur Hossain Khan, Gonesh Chandra Saha, Bappa Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Sarker, and Md Habibur Rahman. 2014. Checking Kaiser, and Illia Polosukhin. 2017. Attention is all
the correctness of bangla words using n-gram. Inter- you need. In Advances in neural information pro-
national Journal of Computer Application, 89(11). cessing systems, pages 5998–6008.
16
Dingmin Wang, Yi Tay, and Li Zhong. 2019.
Confusionset-guided pointer networks for Chinese
spelling check. In Proceedings of the 57th Annual
Meeting of the Association for Computational Lin-
guistics, pages 5780–5785, Florence, Italy. Associa-
tion for Computational Linguistics.
Jinhua Xiong, Qiao Zhang, Shuiyuan Zhang, Jianpeng
Hou, and Xueqi Cheng. 2015. Hanspeller: a unified
framework for chinese spelling correction. In In-
ternational Journal of Computational Linguistics &
Chinese Language Processing, Volume 20, Number
1, June 2015-Special Issue on Chinese as a Foreign
Language.
Shaohua Zhang, Haoran Huang, Jicong Liu, and Hang
Li. 2020. Spelling error correction with soft-masked
bert. arXiv preprint arXiv:2005.07421.
17

2023 Banglalp-1 2

Uploaded by

Copyright:

Available Formats

2023 Banglalp-1 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2023 Banglalp-1 2

Uploaded by

Copyright:

Available Formats

BSpell: A CNN-Blended BERT Based Bangla Spell Checker

Chowdhury Rafeed Rahman MD. Hasibur Rahman

Samiha Zakir and Mohammed Rafsan Mohammed Eunus Ali

Abstract is slow. An accurate spell checker (SC) can be a

Word1 Word2 Wordi Wordn

Transformer Encoder (12)

Transformer Encoder (2) Auxiliary Output

Word1 Word2 Wordi Wordn

Word1 Word2 Wordi Wordn C1 C2 C3 C4 Cm

Figure 4: BSpell architecture details

Table 1: Dataset specification details

5 Results and Discussion characters is predicted incorrectly. Hence character

Synthetic Error Real-Error Real-Error Synthetic Error

Table 3: Comparing BSpell exposed to various pretraining schemes

Synthetic Error Real-Error Real-Error Synthetic Error

5.7 Is BSpell Language Specific? 2 W5 W7

BSpell has originally been designed keeping the 0

unique characteristics of Sanskrit originated lan- 2

BSpell model provides a word for word correction,

You might also like