Bytedance Soft Mask Bert

Uploaded by

andwinds

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views9 pages

Bytedance Soft Mask Bert

Uploaded by

andwinds

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Spelling Error Correction with Soft-Masked BERT

Shaohua Zhang1 , Haoran Huang1 , Jicong Liu2 and Hang Li1

1
ByteDance AI Lab
2
School of Computer Science and Technology, Fudan University
{zhangshaohua.cs,huanghaoran,lihang.lh}@bytedance.com
[email protected]

Abstract Table 1: Examples of Chinese spelling errors

Spelling error correction is an important yet Wrong: 埃及有金子塔。Egypt has golden towers.
arXiv:2005.07421v1 [cs.CL] 15 May 2020

challenging task because a satisfactory solu-

tion of it essentially needs human-level lan- Correct: 埃及有金字塔。Egypt has pyramids.
guage understanding ability. Without loss of
generality we consider Chinese spelling error Wrong: 他的求胜欲很强，为了越狱在挖洞。
correction (CSC) in this paper. A state-of- He has a strong desire to win and is digging for
the-art method for the task selects a character prison breaks
from a list of candidates for correction (includ-
ing non-correction) at each position of the sen-
Correct: 他的求生欲很强，为了越狱在挖洞。
tence on the basis of BERT, the language repre-
sentation model. The accuracy of the method
He has a strong desire to survive and is digging for
can be sub-optimal, however, because BERT prison breaks.
does not have sufficient capability to detect
whether there is an error at each position, ap-
parently due to the way of pre-training it us- per, we consider Chinese spelling error correction
ing mask language modeling. In this work, we (CSC) at character-level.
propose a novel neural architecture to address Spelling error correction is also a very challeng-
the aforementioned issue, which consists of a ing task, because to completely solve the problem
network for error detection and a network for the system needs to have human-level language
error correction based on BERT, with the for-
understanding ability. There are at least two chal-
mer being connected to the latter with what we
call soft-masking technique. Our method of lenges here, as shown in Table 1. First, world
using ‘Soft-Masked BERT’ is general, and it knowledge is needed for spelling error correction.
may be employed in other language detection- Character 字 in the first sentence is mistakenly writ-
correction problems. Experimental results on ten as 子, where 金子塔 means golden tower and
two datasets demonstrate that the performance 金字塔 means pyramid. Humans can correct the
of our proposed method is significantly bet- typo by referring to world knowledge. Second,
ter than the baselines including the one solely
sometimes inference is also required. In the sec-
based on BERT.
ond sentence, the 4-th character 生 is mistakenly
written as 胜. In fact, 胜 and the surrounding char-
1 Introduction
acters form a new valid word 求胜欲 (desire to
Spelling error correction is an important task which win), rather than the intended word 求生欲 (desire
aims to correct spelling errors in a text either at to survive).
word-level or at character-level (Yu and Li, 2014; Many methods have been proposed for CSC or
Yu et al., 2014; Zhang et al., 2015; Wang et al., more generally spelling error correction. Previ-
2018b; Hong et al., 2019; Wang et al., 2019). It ous approaches can be mainly divided into two
is crucial for many natural language applications categories. One employs traditional machine learn-
such as search (Martins and Silva, 2004; Gao et al., ing and the other deep learning (Yu et al., 2014;
2010), optical character recognition (OCR) (Afli Tseng et al., 2015; Wang et al., 2018b). Zhang et
et al., 2016; Wang et al., 2018b), and essay scor- al. (2015), for example, proposed a unified frame-
ing (Burstein and Chodorow, 1999). In this pa- work for CSC consisting of a pipeline of error de-
tection, candidate generation, and final candidate tion under the help of the detection network, during
selection using traditional machine learning. Wang end-to-end joint training.
et al. (2019) proposed a Seq2Seq model with copy We conducted experiments to compare Soft-
mechanism which transforms an input sentence Masked BERT and several baselines including the
into a new sentence with spelling errors corrected. method of using BERT alone. As datasets we uti-
lized the benchmark dataset of SIGHAN. We also
More recently, BERT (Devlin et al., 2018), the
created a large and high-quality dataset for evalua-
language representation model, is successfully ap-
tion named News Title. The dataset, which contains
plied to many language understanding tasks includ-
titles of news articles, is ten times larger than the
ing CSC (cf., (Hong et al., 2019)). In the state-of-
previous datasets. Experimental results show that
the-art method using BERT, a character-level BERT
Soft-Masked BERT significantly outperforms the
is first pre-trained using a large unlabelled dataset
baselines on the two datasets in terms of accuracy
and then fine-tuned using a labeled dataset. The
measures.
labeled data can be obtained via data augmentation
in which examples of spelling errors are generated The contributions of this work include (1) pro-
using a large confusion table. Finally the model is posal of the novel neural architecture Soft-Masked
utilized to predict the most likely character from a BERT for the CSC problem, (2) empirical verifica-
list of candidates at each position of the given sen- tion of the effectiveness of Soft-Masked BERT.
tence. The method is powerful because BERT has
2 Our Approach
certain ability to acquire knowledge for language
understanding. Our experimental results show that 2.1 Problem and Motivation
the accuracy of the method can be further improved, Chinese spelling error correction (CSC) can be
however. One observation is that the error detec- formalized as the following task. Given a sequence
tion capability of the model is not sufficiently high, of n characters (or words) X = (x1 , x2 , · · · , xn ),
and once an error is detected, the model has a better the goal is to transform it into another sequence
chance to make a right correction. We hypothesize of characters Y = (y1 , y2 , · · · , yn ) with the same
that this might be due to the way of pre-training length, where the incorrect characters in X are
BERT with mask language modeling in which only replaced with the correct characters to obtain Y .
about 15% of the characters in the text are masked, The task can be viewed as a sequential labeling
and thus it only learns the distribution of masked problem in which the model is a mapping function
tokens and tends to choose not to make any correc- f : X → Y . The task is an easier one, however, in
tion. This phenomenon is prevalent and represents the sense that usually no or only a few characters
a fundamental challenge for using BERT in certain need to be replaced and all or most of the characters
tasks like spelling error correction. should be copied.
To address the above issue, we propose a novel The state-of-the-art method for CSC is to em-
neural architecture in this work, referred to as Soft- ploy BERT to accomplish the task. Our prelimi-
Masked BERT. Soft-Masked BERT contains two nary experiments show that the performance of the
networks, a detection network and a correction net- approach can be improved, if the erroneous charac-
work based on BERT. The correction network is ters are designated (cf., section 3.6). In general the
similar to that in the method of solely using BERT. BERT based method tends to make no correction
The detection network is a Bi-GRU network that (or just copy the original characters). Our inter-
predicts the probability that the character is an error pretation is that in pre-training of BERT only 15%
at each position. The probability is then utilized to of the characters are masked for prediction, result-
conduct soft-masking of embedding of character at ing in learning of a model which does not possess
the position. Soft masking is an extension of con- enough capacity for error detection. This motives
ventional ‘hard masking’ in the sense that the for- us to devise a new model.
mer degenerates to the latter, when the probability
of error equals one. The soft-masked embedding 2.2 Model
at each position is then inputted into the correction We propose a novel neural network model called
network. The correction network conducts error Soft-Masked BERT for CSC, as illustrated in Fig-
correction using BERT. This approach can force ure 1. Soft-Masked BERT is composed of a detec-
the model to learn the right context for error correction network based on Bi-GRU and a correction net-
Figure 1: Architecture of Soft-Masked BERT

work based on BERT. The detection network pre- embedding of the character, as in BERT. The out-
dicts the probabilities of errors and the correction put is a sequence of labels G = (g1 , g2 , · · · , gn ),
network predicts the probabilities of error correc- where gi denotes the label of the i character, and 1
tions, while the former passes its prediction results means the character is incorrect and 0 means it is
to the latter using soft masking. correct. For each character there is a probability pi
More specifically, our method first creates an representing the likelihood of being 1. The higher
embedding for each character in the input sentence, pi is the more likely the character is incorrect.
referred to as input embedding. Next, it takes the In this work, we realize the detection network
sequence of embeddings as input and outputs the as a bidirectional GRU (Bi-GRU). For each char-
probabilities of errors for the sequence of charac- acter of the sequence, the probability of error pi is
ters (embeddings) using the detection network. Af- defined as
ter that it calculates the weighted sum of the input
embeddings and [MASK] embeddings weighted pi = Pd (gi = 1|X) = σ(Wd hdi + bd ) (1)
by the error probabilities. The calculated embed-
dings mask the likely errors in the sequence in a where Pd (gi = 1|X) denotes the conditional prob-
soft way. Then, our method takes the sequence of ability given by the detection network, σ denotes
soft-masked embeddings as input and outputs the the sigmoid function, hdi denotes the hidden state of
probabilities of error corrections using the correc- Bi-GRU, Wd and bd are parameters. Furthermore,
tion network, which is a BERT model whose final the hidden state is defined as
layer consists of a softmax function for all charac- −
→ →
−
ters. There is also a residual connection between hdi = GRU( h di−1 , ei ) (2)
the input embeddings and the embeddings at the ←
− ←−
hdi = GRU( h di+1 , ei ) (3)
final layer. Next, we describe the details of the −
→ ← −
model. hdi = [hdi ; hdi ] (4)

2.3 Detection Network −→ ← −

where [hdi ; hdi ] denotes concatenation of GRU hid-
The detection network is a sequential binary la- den states from the two directions and GRU is the
beling model. The input is the sequence of em- GRU function.
beddings E = (e1 , e2 , · · · , en ), where ei denotes Soft masking amounts to a weighted sum of in-
the embedding of character xi , which is the sum of put embeddings and mask embeddings with error
word embedding, position embedding, and segment probabilities as weights. The soft-masked embed-
0
ding ei for the i-th character is defined as 2.5 Learning
0
ei = pi · emask + (1 − pi ) · ei (5) The learning of Soft-Masked BERT is conducted
end-to-end, provided that BERT is pre-trained and
where ei is the input embedding and emask is the training data is given which consists of pairs of
mask embedding. If the probability of error is high, original sequence and corrected sequence, denoted
0
then soft-masked embedding ei is close to the mask as = {(X1 , Y1 ), (X2 , Y2 ), . . . , (XN , YN )}. One
embedding emask ; otherwise it is close to the input way to create the training data is to repeatedly gen-
embedding ei . erate a sequence Xi containing errors given a se-
quence Yi without an error, using a confusion table,
2.4 Correction Network
where i = 1, 2, · · · , N .
The correction network is a sequential multi-class The learning process is driven by optimizing
labeling model based on BERT. The input is two objectives, corresponding to error detection
the sequence of soft-masked embeddings E 0 = and error correction respectively.
(e01 , e02 , · · · , e0n ) and the output is a sequence of
n
characters Y = (y1 , y2 , · · · , yn ). X
BERT consists of a stack of 12 identical blocks Ld = − log Pd (gi |X) (11)
i=1
taking the entire sequence as input. Each block con- n
tains a multi-head self-attention operation followed
X
Lc = − log Pc (yi |X) (12)
by a feed-forward network, defined as: i=1

MultiHead(Q, K, V ) where Ld is the objective for training of the detec-

O
(6)
= Concat(head1 ; · · · , headh )W tion network, and Lc is the objective for training
headi = Attention(QWiQ , KWiK , V WiV ) (7) of the correction network (and also the final deci-
sion). The two functions are linearly combined as
FFN(X) = max(0, XW1 + b1 )W2 + b2 (8) the overall objective in learning.
where Q, K, and V are the same matrices represent-
ing the input sequence or the output of the previ- L = λ · Lc + (1 − λ) · Ld (13)
ous block, MultiHead, Attention, and FNN denote
where λ ∈ [0, 1] is coefficient.
multi-head self-attention, self-attention, and feed-
forward network respectively, W O , WiQ , WiK ,
3 Experimental Results
WiV , W1 , W2 , b1 , and b2 are parameters. We de-
note the sequence of hidden states at the final layer 3.1 Datasets
of BERT as H c = (hc1 , hc2 , · · · , hcn )
We made use of the SIGHAN dataset, a benchmark
For each character of the sequence, the probabil-
for CSC1 . SIGHAN is a small dataset containing
ity of error correction is defined as
1,100 texts and 461 types of errors (characters).
0
Pc (yi = j|X) = softmax(W hi + b)[j] (9) The texts are collected from the essay section of
Test of Chinese as Foreign Language and the top-
where Pc (yi = j|X) is the conditional probabil- ics are in a narrow scope. We adopted the stan-
ity that character xi is corrected as character j in dard split of training, development, and test data of
the candidate list, softmax is the softmax function, SIGHAN.
0
hi denotes the hidden state, W and b are parame- We also created a much larger dataset for testing
0
ters. Here the hidden state hi is obtained by linear and development, referred to as News Title. We
combination with the residual connection, sampled the titles of news articles at Toutiao, a Chi-
0
hi = hci + ei (10) nese news app with a large variety of content in
politics, entertainment, sports, education, etc. To
where hci is the hidden state at the final layer and ensure that the dataset contains a sufficient number
ei is the input embedding of character xi . The last of incorrect sentences, we conducted the sampling
layer of correction network exploits a softmax func- from lower quality texts, and thus the error rate of
tion. The character that has the largest probability 1
Following the common practice (Wang et al., 2019), we
is selected from the list of candidates as output for converted the characters in the dataset from traditional Chinese
character xi . to simplified Chinese.
the dataset is higher than usual. Three people con- The pre-trained BERT model utilized
ducted five rounds of labeling to carefully correct in the experiments is the one provided at
spelling errors in the titles. The dataset contains https://github.com/huggingface/transformers. In
15,730 texts. There are 5,423 texts containing er- fine-tuning of BERT, we kept the default hyper-
rors, in 3,441 types. We divided the data into test parameters and only fine-tuned the parameters
set and development set, each containing 7,865 using Adam. In order to reduce the impact of
texts. training tricks, we did not use the dynamic learning
In addition, we followed the common practice in rate strategy and maintained a learning rate 2e−5
CSC to automatically generate a dataset for train- in fine-tuning. The size of hidden unit in Bi-GRU
ing. We first crawled about 5 million news titles is 256 and all models use a batch size of 320.
at the Chinese news app. We also created a con- In the experiments on SIGHAN, for all BERT-
fusion table in which each character is associated based models, we first fine-tuned the model with
with a number of homophonous characters as po- the 5 million training examples and then contin-
tential errors. Next, we randomly replaced 15% ued the fine-tuning with the training examples in
of the characters in the texts with other characters SIGHAN. We removed the unchanged texts in the
to artificially generate errors, where 80% of them training data to improve the efficiency. In the exper-
are homophonous characters in the table and 20% iments on News Title, the models were fine-tuned
of them are random characters. This is because only with the 5 million training examples.
in practice about 80% of spelling errors in Chi- The development sets were utilized for hyper-
nese are homophonous characters due to the use of parameter tuning for both SIGHAN and News Title.
Pinyin-based input methods by people. The best value for hyper-parameter λ was chosen
for each dataset.
3.2 Baselines
For comparison, we adopted the following methods 3.4 Main Results
as baselines. We report the results of the methods Table 2 presents the experimental results of all
from their original papers. methods on the two test datasets. From the ta-
NTOU is a method of using an n-gram model ble, one can observe that the proposed model Soft-
and a rule-based classifier (Tseng et al., 2015). Masked BERT significantly outperforms the base-
NCTU-NTUT is a method of utilizing word vec- line methods on both datasets. Particularly, on
tors and conditional random field (Tseng et al., News Title, Soft-Masked BERT performs much
2015). HanSpeller++ is an unified framework em- better than the baselines in terms of all measures.
ploying a hidden Markov model to generate can- The best results for recall of correction level on
didates and a filter to re-rank candidates (Zhang the News Title dataset are greater than 54%, which
et al., 2015). Hybrid uses a BiLSTM-based model means more than 54% errors will be found and
trained on a generated dataset (Wang et al., 2018b). correction level precision are better than 55%.
Confusionset is a Seq2Seq model consisting of HanSpeller++ achieves the highest precision
a pointer network and copy mechanism (Wang on SIGHAN, apparently because it can eliminate
et al., 2019). FASPell adopts a Seq2Seq model false detections with its large number of manually-
for CSC employing BERT as a denoising auto- crafted rules and features. Although the use of
encoder and a decoder (Hong et al., 2019). BERT- rules and features is effective, the method has high
Pretrain is the method of using a pre-trained cost in development and may also have difficulties
BERT. BERT-Finetune is the method of using a in generalization and adaptation. In some sense, it
fine-tuned BERT. is not directly comparable with the other learning-
based methods including Soft-Masked BERT. The
3.3 Experiment Setting results of all methods except Confusionset are at
As evaluation measures, we utilized sentence-level sentence level not at character level. (The results
accuracy, precision, recall, and F1 score as in most at character level can look better.) Nonetheless,
of the previous work. We evaluated the accuracy Soft-Mask BERT still performs significantly better.
of a method in both detection and correction. Ob- The three methods of using BERT, Soft-Masked
viously correction is more difficult than detection, BERT, BERT-Finetune, and FASPell, perform bet-
because the former is dependent on the latter. ter than the other baselines, while the method of
Table 2: Performances of Different Methods on CSC

Detection Correction
Test Set Method
Acc. Prec. Rec. F1. Acc. Prec. Rec. F1.
NTOU (2015) 42.2 42.2 41.8 42.0 39.0 38.1 35.2 36.6
NCTU-NTUT (2015) 60.1 71.7 33.6 45.7 56.4 66.3 26.1 37.5
HanSpeller++ (2015) 70.1 80.3 53.3 64.0 69.2 79.7 51.5 62.5
Hybird (2018b) - 56.6 69.4 62.3 - - - 57.1
SIGHAN FASPell (2019) 74.2 67.6 60.0 63.5 73.7 66.6 59.1 62.6
Confusionset (2019) - 66.8 73.1 69.8 - 71.5 59.5 64.9
BERT-Pretrain 6.8 3.6 7.0 4.7 5.2 2.0 3.8 2.6
BERT-Finetune 80.0 73.0 70.8 71.9 76.6 65.9 64.0 64.9
Soft-Masked BERT 80.9 73.7 73.2 73.5 77.4 66.7 66.2 66.4
BERT-Pretrain 7.1 1.3 3.6 1.9 0.6 0.6 1.6 0.8
News Title BERT-Finetune 80.0 65.0 61.5 63.2 76.8 55.3 52.3 53.8
Soft-Masked BERT 80.8 65.5 64.0 64.8 77.6 55.8 54.5 55.2

Table 3: Impact of Different Sizes of Training Data

Detection Correction
Train Set Method
Acc. Prec. Rec. F1. Acc. Prec. Rec. F1.
BERT-Finetune 71.8 49.6 48.2 48.9 67.4 36.5 35.5 36.0
500,000
Soft-Masked BERT 72.3 50.3 49.6 50.0 68.2 37.9 37.4 37.6
BERT-Finetune 74.2 54.7 51.3 52.9 70.0 41.6 39.0 40.3
1,000,000
Soft-Masked BERT 75.3 56.3 54.2 55.2 71.1 43.6 41.9 42.7
BERT-Finetune 77.0 59.7 57.0 58.3 73.1 48.0 45.8 46.9
2,000,000
Soft-Masked BERT 77.6 60.0 58.5 59.2 73.7 48.4 47.3 47.8
BERT-Finetune 80.0 65.0 61.5 63.2 76.8 55.3 52.3 53.8
5,000,000
Soft-Masked BERT 80.8 65.5 64.0 64.8 77.6 55.8 54.5 55.2

BERT-Pretrain performs fairly poorly. The re- size is 5 million, indicating that the more train-
sults indicate that BERT without fine-tuning (i.e., ing data is utilized the higher performance can be
BERT-Pretrain) would not work and BERT with achieved. One can also observe that Soft-Masked
fine-tuning (i.e., BERT-Finetune, etc) can boost BERT is consistently superior to BERT-Finetune.
the performances remarkably. Here we see an- A larger λ value means a higher weight on error
other successful application of BERT, which can correction. Error detection is an easier task than
acquire certain amount of knowledge for language error correction, because essentially the former is
understanding. Furthermore, Soft-Masked BERT a binary classification problem while the latter is a
can beat BERT-Finetune by large margins on both multi-class classification problem. Table 5 presents
datasets. The results suggest that error detection is the results of Soft-Masked BERT in different values
important for the utilization of BERT in CSC and of hyper-parameter λ. The highest F1 score is
soft masking is really an effective means. obtained when λ is 0.8. That means that a good
compromise between detection and correction is
3.5 Effect of Hyper Parameter reached.
We present the results of Soft-Masked BERT on
(the test data of) News Title to illustrate the effect 3.6 Ablation Study
of parameter and data size. We carried out ablation study on Soft-Masked
Table 3 shows the results of Soft-Masked BERT BERT on both datasets. Table 4 shows the results
as well as BERT-Finetune learned with different on News Title. (We omit the results on SIGHAN
sizes of training data. One can find that the best due to space limitation, which have similar trends.)
result is obtained for Soft-Masked BERT when the In Soft-Masked BERT-R, the residual connection
Table 4: Ablation Study of Soft-Masked BERT on News Title

Detection Correction
Method
Acc. Prec. Rec. F1. Acc. Prec. Rec. F1.
BERT-Finetune
89.9 75.6 90.3 82.3 82.9 58.4 69.8 63.6
+Force(Upper Bound)
Soft-Masked BERT 80.8 65.5 64.0 64.8 77.6 55.8 54.5 55.2
Soft-Masked BERT-R 81.0 75.2 53.9 62.8 78.4 64.6 46.3 53.9
Rand-Masked BERT 70.9 46.6 48.5 47.5 68.1 38.8 40.3 39.5
BERT-Finetune 80.0 65.0 61.5 63.2 76.8 55.3 52.3 53.8
Hard-Masked BERT (0.95) 80.6 65.3 63.2 64.2 76.7 53.6 51.8 52.7
Hard-Masked BERT (0.9) 77.4 57.8 60.3 59.0 72.4 44.0 45.8 44.9
Hard-Masked BERT (0.7) 65.3 38.0 50.9 43.5 58.9 24.2 32.5 27.7

Table 5: Impact of Different Values of λ 了’(I can speak a little Chinese, but I don’t under-
stand man. So I got lost.). The word ‘汉子’(man) is
Detection Correction incorrect and should be written as ‘汉字’(Chinese
𝜆
Acc. Pre. Rec. F1. Acc. Pre. Rec. F1. character). BERT-Finetune can not rectify the mis-
0.8 72.3 50.3 49.6 50.0 68.2 37.9 37.4 37.6 take, but Soft-Masked BERT can, because the error
0.5 72.3 50.0 49.3 49.7 68.0 37.5 37.0 37.3
correction can only be accurately conducted with
global context information.
0.2 71.5 48.6 50.4 49.5 66.9 35.7 37.1 36.4
We also found that there are two major types of
errors in almost all methods including Soft-Masked
in the model is removed. In Hard-Masked BERT, if BERT, which affect the performances. For statis-
the error probability given by the detection network tics of errors, we sampled 100 errors from test set.
exceeds a threshold (0.95, 0.9, 07), then the embed- We found that 67% of errors require strong reason-
ding of the current character is set to the embedding ing ability, 11% of errors are due to lack of world
of the [MASK] token, otherwise the embedding re- knowledge, and the remaining 22% of errors have
mains unchanged. In Rand-Masked BERT, the er- no significant type.
ror probability is randomized with a value between The first type of errors is due to lack of inference
0 and 1. We can see that all the major components ability. Accurate correction of such typos requires
of Soft-Masked BERT are necessary for achieving stronger inference ability. For example, for the
high performance. We also tried ‘BERT-Finetune sentence ‘他主动拉了姑娘的手, 心里很高心, 嘴
+ Force’, whose performance can be viewed as an 上故作生气’ (He intentionally took the girl’s hand,
upper bound. In the method, we let BERT-Finetune and was very x, but was pretending to be angry.)
to only make prediction at the position where there where the incorrect word ‘x’ is not comprehensible,
is an error and select a character from the rest of there might be two possible corrections, changing
the candidate list. The result indicates that there ‘高心’ to ‘寒心’(chilled) and changing ‘高心’ to
is still large room for Soft-Masked BERT to make ‘高兴’(happy), while the latter is more reasonable
improvement. for humans. One can see that in order to make more
reliable corrections, the models must have stronger
3.7 Discussions inference ability.
We observed that Soft-Masked BERT is able to The second type of errors is due to lack of world
make more effective use of global context informa- knowledge. For example, in the sentence ‘芜湖: 女
tion than BERT-Finetune. With soft masking the 子落入青戈江,众人齐救援’ (Wuhu: the woman
likely errors are identified, and as a result the model fell into the Qingge River, and people tried hard to
can better leverage the power of BERT to make sen- rescue her.), ‘青戈江’ (Qingge River) is a typo of
sible reasoning for error correction by referring to ‘青弋江’ (Qingyu River). Humans can discover the
not only local context but also global context. For typo because the river in Wuhu city China is called
example, there is a typo in the sentence ‘我会说 Qingyu not Qingge. It is still very challenging for
一点儿，不过一个汉子也看不懂，所以我迷路 the existing models in general AI systems to detect
and correct such kind of errors. more specifically Chinese spelling error correction
(CSC). Our model called Soft-Masked BERT is
4 Related Work composed of a detection network and a correction
network based on BERT. The detection network
Various studies have been conducted on spelling er-
identifies likely incorrect characters in the given
ror correction so far, which plays an important role
sentence and soft-masks the characters. The cor-
in many applications, including search (Gao et al.,
rection network takes the soft-masked characters
2010), optical character recognition (OCR) (Afli
as input and makes correction on the characters.
et al., 2016), and essay scoring (Burstein and
The technique of soft-masking is general and po-
Chodorow, 1999).
tentially useful in other detection-correction tasks.
Chinese spelling error correction (CSC) is a spe-
Experimental results on two datasets show that
cial case, but is more challenging due to its con-
Soft-Masked BERT significantly outperforms the
flation with Chinese word segmentation, which re-
state-of-art method of solely utilizing BERT. As
ceived a considerable number of investigations (Yu
future work, we plan to extend Soft-Masked BERT
et al., 2014; Yu and Li, 2014; Tseng et al., 2015;
to other problems like grammatical error correction
Wang et al., 2019). Early work in CSC followed the
and explore other possibilities of implementing the
pipeline of error detection, candidate generation,
detection network.
and final candidate selection. Some researchers
employed unsupervised methods using language
models and rules (Yu and Li, 2014; Tseng et al.,
References
2015) and the others viewed it as a sequential la-
beling problem and employed conditional random Haithem Afli, Zhengwei Qiu, Andy Way, and Páraic
fields or hidden Markov models (Tseng et al., 2015; Sheridan. 2016. Using smt for ocr error correction
of historical texts. In Proceedings of the Tenth In-
Zhang et al., 2015). Recently, deep learning was ap- ternational Conference on Language Resources and
plied to spelling error correction (Guo et al., 2019; Evaluation (LREC’16), pages 962–966.
Wang et al., 2019), and for example, a Seq2Seq
model with BERT as encoder was employed (Hong Jill Burstein and Martin Chodorow. 1999. Automated
et al., 2019), which transforms the input sentence essay scoring for nonnative english speakers. In
Proceedings of a Symposium on Computer Medi-
into a new sentence with spelling errors corrected. ated Language Assessment and Evaluation in Natu-
BERT (Devlin et al., 2018) is a language rep- ral Language Processing, pages 68–75. Association
resentation model with Transformer encoder as for Computational Linguistics.
its architecture. BERT is first pre-trained using
a very large corpus in a self-supervised fashion Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
(mask language modeling and next sentence predic- bidirectional transformers for language understand-
tion). Then, it is fine-tuned using a small amount ing. arXiv preprint arXiv:1810.04805.
of labeled data in a down-stream task. Since its
inception BERT has demonstrated superior perfor- Jianfeng Gao, Xiaolong Li, Daniel Micol, Chris Quirk,
and Xu Sun. 2010. A large scale ranker-based sys-
mances in almost all the language understanding
tem for search query spelling correction. In COL-
tasks, such as those in the GLUE challenge (Wang ING 2010, 23rd International Conference on Com-
et al., 2018a). BERT has shown strong ability to putational Linguistics, Proceedings of the Confer-
acquire and utilize knowledge for language un- ence, 23-27 August 2010, Beijing, China, pages 358–
derstanding. Recently, other language represen- 366.
tation models have also been proposed, such as Jinxi Guo, Tara N Sainath, and Ron J Weiss. 2019.
XLNET (Yang et al., 2019), Roberta (Liu et al., A spelling correction model for end-to-end speech
2019), and ALBERT (Lan et al., 2019). In this recognition. In ICASSP 2019-2019 IEEE Interna-
work, we extend BERT to Soft Masked BERT for tional Conference on Acoustics, Speech and Signal
spelling error correction and as far as we know no Processing (ICASSP), pages 5651–5655. IEEE.
similar architecture was proposed before. Yuzhong Hong, Xianguo Yu, Neng He, Nan Liu, and
Junhui Liu. 2019. Faspell: A fast, adaptable, sim-
5 Conclusion ple, powerful chinese spell checker based on dae-
decoder paradigm. In Proceedings of the 5th Work-
In this paper, we have proposed a novel neural shop on Noisy User-generated Text (W-NUT 2019),
network architecture for spelling error correction, pages 160–169.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Kevin Gimpel, Piyush Sharma, and Radu Soricut.
2019. Albert: A lite bert for self-supervised learn-
ing of language representations. arXiv preprint
arXiv:1909.11942.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.
Bruno Martins and Mário J. Silva. 2004. Spelling
correction for search engine queries. In Advances
in Natural Language Processing, 4th International
Conference, EsTAL 2004, Alicante, Spain, October
20-22, 2004, Proceedings, pages 372–383.
Yuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang, and
Hsin-Hsi Chen. 2015. Introduction to sighan 2015
bake-off for chinese spelling check. In Proceedings
of the Eighth SIGHAN Workshop on Chinese Lan-
guage Processing, pages 32–37.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel R Bowman. 2018a.
Glue: A multi-task benchmark and analysis platform
for natural language understanding. arXiv preprint
arXiv:1804.07461.
Dingmin Wang, Yan Song, Jing Li, Jialong Han, and
Haisong Zhang. 2018b. A hybrid approach to auto-
matic corpus generation for chinese spelling check.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2517–2527.
Dingmin Wang, Yi Tay, and Li Zhong. 2019.
Confusionset-guided pointer networks for chinese
spelling check. In Proceedings of the 57th Annual
Meeting of the Association for Computational Lin-
guistics, pages 5780–5785.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Ruslan Salakhutdinov, and Quoc V Le.
2019. Xlnet: Generalized autoregressive pretrain-
ing for language understanding. arXiv preprint
arXiv:1906.08237.
Junjie Yu and Zhenghua Li. 2014. Chinese spelling
error detection and correction based on language
model, pronunciation, and shape. In Proceedings of
The Third CIPS-SIGHAN Joint Conference on Chi-
nese Language Processing, pages 220–223, Wuhan,
China. Association for Computational Linguistics.
Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and
Hsin-Hsi Chen. 2014. Overview of sighan 2014
bake-off for chinese spelling check. In Proceed-
ings of The Third CIPS-SIGHAN Joint Conference
on Chinese Language Processing, pages 126–132.
Shuiyuan Zhang, Jinhua Xiong, Jianpeng Hou, Qiao
Zhang, and Xueqi Cheng. 2015. Hanspeller++: A
unified framework for chinese spelling correction.
In Proceedings of the Eighth SIGHAN Workshop on
Chinese Language Processing, pages 38–45.

Spelling Correction and Detection in NLP An Overview
No ratings yet
Spelling Correction and Detection in NLP An Overview
9 pages
Ir Practical 3
No ratings yet
Ir Practical 3
4 pages
Health 9 Performance Task
75% (4)
Health 9 Performance Task
4 pages
Spelling Correction and Detection in NLP: An: K.Dhanush Kumar 22R21A66F8 CSM-C
No ratings yet
Spelling Correction and Detection in NLP: An: K.Dhanush Kumar 22R21A66F8 CSM-C
9 pages
Lec04 SpellingCorrection
No ratings yet
Lec04 SpellingCorrection
25 pages
General and Domain-Adaptive Chinese Spelling Check With Error-Consistent Pretraining
No ratings yet
General and Domain-Adaptive Chinese Spelling Check With Error-Consistent Pretraining
18 pages
A Simple Yet Effective Training-Free Prompt-Free Approach
No ratings yet
A Simple Yet Effective Training-Free Prompt-Free Approach
22 pages
Noise Channel Model
No ratings yet
Noise Channel Model
32 pages
Exploring Local Semantic Features For Chinese Spelling Correction
No ratings yet
Exploring Local Semantic Features For Chinese Spelling Correction
5 pages
Kukich, K. (1992) - Techniques For Automatically Correcting Words in Text. ACM Computing Surveys, 24 (4) - 377-439.
No ratings yet
Kukich, K. (1992) - Techniques For Automatically Correcting Words in Text. ACM Computing Surveys, 24 (4) - 377-439.
63 pages
Mini 3 Merged
No ratings yet
Mini 3 Merged
30 pages
Chinese Grammar
No ratings yet
Chinese Grammar
14 pages
Non-Autoregressive Chinese ASR Error Correction With Phonological Training
No ratings yet
Non-Autoregressive Chinese ASR Error Correction With Phonological Training
11 pages
Cme4408 p7 Probmodels Med
No ratings yet
Cme4408 p7 Probmodels Med
50 pages
Past Participles: Below Are 64 Commonly Used Irregular Past Participles in English
No ratings yet
Past Participles: Below Are 64 Commonly Used Irregular Past Participles in English
1 page
Automatic Prediction On DC Compunds
No ratings yet
Automatic Prediction On DC Compunds
12 pages
Spelling Correction and The Noisy Channel
No ratings yet
Spelling Correction and The Noisy Channel
51 pages
Article 3
No ratings yet
Article 3
18 pages
Chinese Gra KDistillation
No ratings yet
Chinese Gra KDistillation
9 pages
FCGEC-Fine Grained Corpus For Chinese Grammatical Error Correction
No ratings yet
FCGEC-Fine Grained Corpus For Chinese Grammatical Error Correction
19 pages
Grammar Error Correction Using Natural Language Processing
No ratings yet
Grammar Error Correction Using Natural Language Processing
6 pages
2022 Aacl-Short 51
No ratings yet
2022 Aacl-Short 51
7 pages
Python 1 1-1
No ratings yet
Python 1 1-1
15 pages
Sentence-Level Feedback Generation For English Lan
No ratings yet
Sentence-Level Feedback Generation For English Lan
7 pages
Performance Benchmarking of Automated Sentence Denoising Using Deep Learning
No ratings yet
Performance Benchmarking of Automated Sentence Denoising Using Deep Learning
6 pages
Automatic Grammatical Error Correction Based On Edit Operations Information
No ratings yet
Automatic Grammatical Error Correction Based On Edit Operations Information
12 pages
Lee06g Interspeech
No ratings yet
Lee06g Interspeech
4 pages
NLP Sem 5
No ratings yet
NLP Sem 5
4 pages
FULLTEXT01
No ratings yet
FULLTEXT01
63 pages
IS061299
No ratings yet
IS061299
4 pages
Chinese Spelling Check Based On Sequence Labeling: Zijia Han Chengguo LV
No ratings yet
Chinese Spelling Check Based On Sequence Labeling: Zijia Han Chengguo LV
6 pages
Persian Typographical Error Type Detection Using Many To 29w9sbz6
No ratings yet
Persian Typographical Error Type Detection Using Many To 29w9sbz6
17 pages
Task 1
No ratings yet
Task 1
5 pages
Overview of NLPTEA-2020 Shared Task For Chinese Grammatical Error Diagnosis
No ratings yet
Overview of NLPTEA-2020 Shared Task For Chinese Grammatical Error Diagnosis
11 pages
2023 Banglalp-1 2
No ratings yet
2023 Banglalp-1 2
11 pages
Neuspell: A Neural Spelling Correction Toolkit
No ratings yet
Neuspell: A Neural Spelling Correction Toolkit
7 pages
A Hybrid Approach To Automatic Corpus Generation For Chinese Spelling Check
No ratings yet
A Hybrid Approach To Automatic Corpus Generation For Chinese Spelling Check
11 pages
Introduction To SIGHAN 2015 Bake-Off For Chinese Spelling Check
No ratings yet
Introduction To SIGHAN 2015 Bake-Off For Chinese Spelling Check
6 pages
Spelling Error Correction With BERT Based On Character-Phonetic
No ratings yet
Spelling Error Correction With BERT Based On Character-Phonetic
5 pages
Word Suggestions For Non-Word Text Errors Using Similarity Measure
No ratings yet
Word Suggestions For Non-Word Text Errors Using Similarity Measure
6 pages
To Enhance Your Custom GPT Model
No ratings yet
To Enhance Your Custom GPT Model
20 pages
IR Prac 3
No ratings yet
IR Prac 3
2 pages
18 nlp2
No ratings yet
18 nlp2
13 pages
An Improved Error Model For Noisy Channel Spelling Correction
No ratings yet
An Improved Error Model For Noisy Channel Spelling Correction
8 pages
A Multilayer Convolutional Encoder-Decoder Neural Network For Grammatical Error Correction
No ratings yet
A Multilayer Convolutional Encoder-Decoder Neural Network For Grammatical Error Correction
8 pages
Python With Textblob
No ratings yet
Python With Textblob
5 pages
Minimum Edit Distance.
No ratings yet
Minimum Edit Distance.
12 pages
Grammer Error Proceeding
No ratings yet
Grammer Error Proceeding
7 pages
Connecting DOTS
No ratings yet
Connecting DOTS
7 pages
Lec # 5-1
No ratings yet
Lec # 5-1
22 pages
C90 2036 PDF
No ratings yet
C90 2036 PDF
6 pages
FASPell A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-decoder Paradigm
No ratings yet
FASPell A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-decoder Paradigm
10 pages
Spell Correction Static Translation
No ratings yet
Spell Correction Static Translation
10 pages
Finite-State Spell-Checking With Weighted Language
No ratings yet
Finite-State Spell-Checking With Weighted Language
7 pages
Pronunciation Modeling in Spelling Correction
No ratings yet
Pronunciation Modeling in Spelling Correction
6 pages
Laxia and Dalin de Graff - Spelling - Correction
No ratings yet
Laxia and Dalin de Graff - Spelling - Correction
4 pages
Spelling Noisy Channel
No ratings yet
Spelling Noisy Channel
5 pages
Spelling Correction For Search Engine Queries
No ratings yet
Spelling Correction For Search Engine Queries
13 pages
Relevant
No ratings yet
Relevant
7 pages
Group 6 TEFL - Suggestopedia
100% (1)
Group 6 TEFL - Suggestopedia
5 pages
Articl Es: A Man, A Boy, A University, A European, A Pen, A Dog, A Useful Thing, A Useless Pen, Etc
No ratings yet
Articl Es: A Man, A Boy, A University, A European, A Pen, A Dog, A Useful Thing, A Useless Pen, Etc
9 pages
Lesson Plan in English For Grade 2
No ratings yet
Lesson Plan in English For Grade 2
4 pages
Synopsis On Spell Cheker
No ratings yet
Synopsis On Spell Cheker
12 pages
T19 PDF
No ratings yet
T19 PDF
33 pages
Exploring English 6A
No ratings yet
Exploring English 6A
132 pages
Simple Essay Rubric
No ratings yet
Simple Essay Rubric
2 pages
2.1. Song of The Open Road
No ratings yet
2.1. Song of The Open Road
15 pages
How To Write A Spelling Corrector
No ratings yet
How To Write A Spelling Corrector
10 pages
Cronica Ga2 240202501 Aa1 Ev03
No ratings yet
Cronica Ga2 240202501 Aa1 Ev03
3 pages
Shirley Purdie My Story Teacher Notes
No ratings yet
Shirley Purdie My Story Teacher Notes
4 pages
English Practice 50: Section One: Phonetics
No ratings yet
English Practice 50: Section One: Phonetics
6 pages
Idioms
100% (1)
Idioms
5 pages
Pragmatics 2nd Yan Huang Download
No ratings yet
Pragmatics 2nd Yan Huang Download
91 pages
Reading Lectures
No ratings yet
Reading Lectures
55 pages
Type of Reading: Reading For Appreciation and Enjoyment
No ratings yet
Type of Reading: Reading For Appreciation and Enjoyment
18 pages
Test 7 - Entrance Exam Practice Test
No ratings yet
Test 7 - Entrance Exam Practice Test
4 pages
Full Practice Makes Perfect Complete French Grammar Heminway PDF All Chapters
100% (3)
Full Practice Makes Perfect Complete French Grammar Heminway PDF All Chapters
47 pages
1096093
No ratings yet
1096093
27 pages
English 3
No ratings yet
English 3
3 pages
Present Perfect Tense Quizziz 2
No ratings yet
Present Perfect Tense Quizziz 2
4 pages
Roehampton University Events Listings 2010-11
No ratings yet
Roehampton University Events Listings 2010-11
16 pages
Phonological Processes
No ratings yet
Phonological Processes
18 pages
DLL - All Subjects 2 - Q4 - W1 - D4
No ratings yet
DLL - All Subjects 2 - Q4 - W1 - D4
10 pages
Gast - de - The DTZ at A Glance
No ratings yet
Gast - de - The DTZ at A Glance
7 pages
L3 Intensive Reading - Grammar Unidad 2
No ratings yet
L3 Intensive Reading - Grammar Unidad 2
4 pages
Superstitions Chart PDF
No ratings yet
Superstitions Chart PDF
3 pages
Profile
No ratings yet
Profile
6 pages
Past Simple. Regular and Irregular Verb Forms.
No ratings yet
Past Simple. Regular and Irregular Verb Forms.
2 pages
Developing A Concept Paper
No ratings yet
Developing A Concept Paper
2 pages
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
From Everand
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
Lena Neill
No ratings yet