Bytedance Soft Mask Bert
Bytedance Soft Mask Bert
Spelling error correction is an important yet Wrong: 埃及有金子塔。Egypt has golden towers.
arXiv:2005.07421v1 [cs.CL] 15 May 2020
work based on BERT. The detection network pre- embedding of the character, as in BERT. The out-
dicts the probabilities of errors and the correction put is a sequence of labels G = (g1 , g2 , · · · , gn ),
network predicts the probabilities of error correc- where gi denotes the label of the i character, and 1
tions, while the former passes its prediction results means the character is incorrect and 0 means it is
to the latter using soft masking. correct. For each character there is a probability pi
More specifically, our method first creates an representing the likelihood of being 1. The higher
embedding for each character in the input sentence, pi is the more likely the character is incorrect.
referred to as input embedding. Next, it takes the In this work, we realize the detection network
sequence of embeddings as input and outputs the as a bidirectional GRU (Bi-GRU). For each char-
probabilities of errors for the sequence of charac- acter of the sequence, the probability of error pi is
ters (embeddings) using the detection network. Af- defined as
ter that it calculates the weighted sum of the input
embeddings and [MASK] embeddings weighted pi = Pd (gi = 1|X) = σ(Wd hdi + bd ) (1)
by the error probabilities. The calculated embed-
dings mask the likely errors in the sequence in a where Pd (gi = 1|X) denotes the conditional prob-
soft way. Then, our method takes the sequence of ability given by the detection network, σ denotes
soft-masked embeddings as input and outputs the the sigmoid function, hdi denotes the hidden state of
probabilities of error corrections using the correc- Bi-GRU, Wd and bd are parameters. Furthermore,
tion network, which is a BERT model whose final the hidden state is defined as
layer consists of a softmax function for all charac- −
→ →
−
ters. There is also a residual connection between hdi = GRU( h di−1 , ei ) (2)
the input embeddings and the embeddings at the ←
− ←−
hdi = GRU( h di+1 , ei ) (3)
final layer. Next, we describe the details of the −
→ ← −
model. hdi = [hdi ; hdi ] (4)
Detection Correction
Test Set Method
Acc. Prec. Rec. F1. Acc. Prec. Rec. F1.
NTOU (2015) 42.2 42.2 41.8 42.0 39.0 38.1 35.2 36.6
NCTU-NTUT (2015) 60.1 71.7 33.6 45.7 56.4 66.3 26.1 37.5
HanSpeller++ (2015) 70.1 80.3 53.3 64.0 69.2 79.7 51.5 62.5
Hybird (2018b) - 56.6 69.4 62.3 - - - 57.1
SIGHAN FASPell (2019) 74.2 67.6 60.0 63.5 73.7 66.6 59.1 62.6
Confusionset (2019) - 66.8 73.1 69.8 - 71.5 59.5 64.9
BERT-Pretrain 6.8 3.6 7.0 4.7 5.2 2.0 3.8 2.6
BERT-Finetune 80.0 73.0 70.8 71.9 76.6 65.9 64.0 64.9
Soft-Masked BERT 80.9 73.7 73.2 73.5 77.4 66.7 66.2 66.4
BERT-Pretrain 7.1 1.3 3.6 1.9 0.6 0.6 1.6 0.8
News Title BERT-Finetune 80.0 65.0 61.5 63.2 76.8 55.3 52.3 53.8
Soft-Masked BERT 80.8 65.5 64.0 64.8 77.6 55.8 54.5 55.2
Detection Correction
Train Set Method
Acc. Prec. Rec. F1. Acc. Prec. Rec. F1.
BERT-Finetune 71.8 49.6 48.2 48.9 67.4 36.5 35.5 36.0
500,000
Soft-Masked BERT 72.3 50.3 49.6 50.0 68.2 37.9 37.4 37.6
BERT-Finetune 74.2 54.7 51.3 52.9 70.0 41.6 39.0 40.3
1,000,000
Soft-Masked BERT 75.3 56.3 54.2 55.2 71.1 43.6 41.9 42.7
BERT-Finetune 77.0 59.7 57.0 58.3 73.1 48.0 45.8 46.9
2,000,000
Soft-Masked BERT 77.6 60.0 58.5 59.2 73.7 48.4 47.3 47.8
BERT-Finetune 80.0 65.0 61.5 63.2 76.8 55.3 52.3 53.8
5,000,000
Soft-Masked BERT 80.8 65.5 64.0 64.8 77.6 55.8 54.5 55.2
BERT-Pretrain performs fairly poorly. The re- size is 5 million, indicating that the more train-
sults indicate that BERT without fine-tuning (i.e., ing data is utilized the higher performance can be
BERT-Pretrain) would not work and BERT with achieved. One can also observe that Soft-Masked
fine-tuning (i.e., BERT-Finetune, etc) can boost BERT is consistently superior to BERT-Finetune.
the performances remarkably. Here we see an- A larger λ value means a higher weight on error
other successful application of BERT, which can correction. Error detection is an easier task than
acquire certain amount of knowledge for language error correction, because essentially the former is
understanding. Furthermore, Soft-Masked BERT a binary classification problem while the latter is a
can beat BERT-Finetune by large margins on both multi-class classification problem. Table 5 presents
datasets. The results suggest that error detection is the results of Soft-Masked BERT in different values
important for the utilization of BERT in CSC and of hyper-parameter λ. The highest F1 score is
soft masking is really an effective means. obtained when λ is 0.8. That means that a good
compromise between detection and correction is
3.5 Effect of Hyper Parameter reached.
We present the results of Soft-Masked BERT on
(the test data of) News Title to illustrate the effect 3.6 Ablation Study
of parameter and data size. We carried out ablation study on Soft-Masked
Table 3 shows the results of Soft-Masked BERT BERT on both datasets. Table 4 shows the results
as well as BERT-Finetune learned with different on News Title. (We omit the results on SIGHAN
sizes of training data. One can find that the best due to space limitation, which have similar trends.)
result is obtained for Soft-Masked BERT when the In Soft-Masked BERT-R, the residual connection
Table 4: Ablation Study of Soft-Masked BERT on News Title
Detection Correction
Method
Acc. Prec. Rec. F1. Acc. Prec. Rec. F1.
BERT-Finetune
89.9 75.6 90.3 82.3 82.9 58.4 69.8 63.6
+Force(Upper Bound)
Soft-Masked BERT 80.8 65.5 64.0 64.8 77.6 55.8 54.5 55.2
Soft-Masked BERT-R 81.0 75.2 53.9 62.8 78.4 64.6 46.3 53.9
Rand-Masked BERT 70.9 46.6 48.5 47.5 68.1 38.8 40.3 39.5
BERT-Finetune 80.0 65.0 61.5 63.2 76.8 55.3 52.3 53.8
Hard-Masked BERT (0.95) 80.6 65.3 63.2 64.2 76.7 53.6 51.8 52.7
Hard-Masked BERT (0.9) 77.4 57.8 60.3 59.0 72.4 44.0 45.8 44.9
Hard-Masked BERT (0.7) 65.3 38.0 50.9 43.5 58.9 24.2 32.5 27.7
Table 5: Impact of Different Values of λ 了’(I can speak a little Chinese, but I don’t under-
stand man. So I got lost.). The word ‘汉子’(man) is
Detection Correction incorrect and should be written as ‘汉字’(Chinese
𝜆
Acc. Pre. Rec. F1. Acc. Pre. Rec. F1. character). BERT-Finetune can not rectify the mis-
0.8 72.3 50.3 49.6 50.0 68.2 37.9 37.4 37.6 take, but Soft-Masked BERT can, because the error
0.5 72.3 50.0 49.3 49.7 68.0 37.5 37.0 37.3
correction can only be accurately conducted with
global context information.
0.2 71.5 48.6 50.4 49.5 66.9 35.7 37.1 36.4
We also found that there are two major types of
errors in almost all methods including Soft-Masked
in the model is removed. In Hard-Masked BERT, if BERT, which affect the performances. For statis-
the error probability given by the detection network tics of errors, we sampled 100 errors from test set.
exceeds a threshold (0.95, 0.9, 07), then the embed- We found that 67% of errors require strong reason-
ding of the current character is set to the embedding ing ability, 11% of errors are due to lack of world
of the [MASK] token, otherwise the embedding re- knowledge, and the remaining 22% of errors have
mains unchanged. In Rand-Masked BERT, the er- no significant type.
ror probability is randomized with a value between The first type of errors is due to lack of inference
0 and 1. We can see that all the major components ability. Accurate correction of such typos requires
of Soft-Masked BERT are necessary for achieving stronger inference ability. For example, for the
high performance. We also tried ‘BERT-Finetune sentence ‘他主动拉了姑娘的手, 心里很高心, 嘴
+ Force’, whose performance can be viewed as an 上故作生气’ (He intentionally took the girl’s hand,
upper bound. In the method, we let BERT-Finetune and was very x, but was pretending to be angry.)
to only make prediction at the position where there where the incorrect word ‘x’ is not comprehensible,
is an error and select a character from the rest of there might be two possible corrections, changing
the candidate list. The result indicates that there ‘高心’ to ‘寒心’(chilled) and changing ‘高心’ to
is still large room for Soft-Masked BERT to make ‘高兴’(happy), while the latter is more reasonable
improvement. for humans. One can see that in order to make more
reliable corrections, the models must have stronger
3.7 Discussions inference ability.
We observed that Soft-Masked BERT is able to The second type of errors is due to lack of world
make more effective use of global context informa- knowledge. For example, in the sentence ‘芜湖: 女
tion than BERT-Finetune. With soft masking the 子落入青戈江,众人齐救援’ (Wuhu: the woman
likely errors are identified, and as a result the model fell into the Qingge River, and people tried hard to
can better leverage the power of BERT to make sen- rescue her.), ‘青戈江’ (Qingge River) is a typo of
sible reasoning for error correction by referring to ‘青弋江’ (Qingyu River). Humans can discover the
not only local context but also global context. For typo because the river in Wuhu city China is called
example, there is a typo in the sentence ‘我会说 Qingyu not Qingge. It is still very challenging for
一点儿,不过一个汉子也看不懂,所以我迷路 the existing models in general AI systems to detect
and correct such kind of errors. more specifically Chinese spelling error correction
(CSC). Our model called Soft-Masked BERT is
4 Related Work composed of a detection network and a correction
network based on BERT. The detection network
Various studies have been conducted on spelling er-
identifies likely incorrect characters in the given
ror correction so far, which plays an important role
sentence and soft-masks the characters. The cor-
in many applications, including search (Gao et al.,
rection network takes the soft-masked characters
2010), optical character recognition (OCR) (Afli
as input and makes correction on the characters.
et al., 2016), and essay scoring (Burstein and
The technique of soft-masking is general and po-
Chodorow, 1999).
tentially useful in other detection-correction tasks.
Chinese spelling error correction (CSC) is a spe-
Experimental results on two datasets show that
cial case, but is more challenging due to its con-
Soft-Masked BERT significantly outperforms the
flation with Chinese word segmentation, which re-
state-of-art method of solely utilizing BERT. As
ceived a considerable number of investigations (Yu
future work, we plan to extend Soft-Masked BERT
et al., 2014; Yu and Li, 2014; Tseng et al., 2015;
to other problems like grammatical error correction
Wang et al., 2019). Early work in CSC followed the
and explore other possibilities of implementing the
pipeline of error detection, candidate generation,
detection network.
and final candidate selection. Some researchers
employed unsupervised methods using language
models and rules (Yu and Li, 2014; Tseng et al.,
References
2015) and the others viewed it as a sequential la-
beling problem and employed conditional random Haithem Afli, Zhengwei Qiu, Andy Way, and Páraic
fields or hidden Markov models (Tseng et al., 2015; Sheridan. 2016. Using smt for ocr error correction
of historical texts. In Proceedings of the Tenth In-
Zhang et al., 2015). Recently, deep learning was ap- ternational Conference on Language Resources and
plied to spelling error correction (Guo et al., 2019; Evaluation (LREC’16), pages 962–966.
Wang et al., 2019), and for example, a Seq2Seq
model with BERT as encoder was employed (Hong Jill Burstein and Martin Chodorow. 1999. Automated
et al., 2019), which transforms the input sentence essay scoring for nonnative english speakers. In
Proceedings of a Symposium on Computer Medi-
into a new sentence with spelling errors corrected. ated Language Assessment and Evaluation in Natu-
BERT (Devlin et al., 2018) is a language rep- ral Language Processing, pages 68–75. Association
resentation model with Transformer encoder as for Computational Linguistics.
its architecture. BERT is first pre-trained using
a very large corpus in a self-supervised fashion Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2018. Bert: Pre-training of deep
(mask language modeling and next sentence predic- bidirectional transformers for language understand-
tion). Then, it is fine-tuned using a small amount ing. arXiv preprint arXiv:1810.04805.
of labeled data in a down-stream task. Since its
inception BERT has demonstrated superior perfor- Jianfeng Gao, Xiaolong Li, Daniel Micol, Chris Quirk,
and Xu Sun. 2010. A large scale ranker-based sys-
mances in almost all the language understanding
tem for search query spelling correction. In COL-
tasks, such as those in the GLUE challenge (Wang ING 2010, 23rd International Conference on Com-
et al., 2018a). BERT has shown strong ability to putational Linguistics, Proceedings of the Confer-
acquire and utilize knowledge for language un- ence, 23-27 August 2010, Beijing, China, pages 358–
derstanding. Recently, other language represen- 366.
tation models have also been proposed, such as Jinxi Guo, Tara N Sainath, and Ron J Weiss. 2019.
XLNET (Yang et al., 2019), Roberta (Liu et al., A spelling correction model for end-to-end speech
2019), and ALBERT (Lan et al., 2019). In this recognition. In ICASSP 2019-2019 IEEE Interna-
work, we extend BERT to Soft Masked BERT for tional Conference on Acoustics, Speech and Signal
spelling error correction and as far as we know no Processing (ICASSP), pages 5651–5655. IEEE.
similar architecture was proposed before. Yuzhong Hong, Xianguo Yu, Neng He, Nan Liu, and
Junhui Liu. 2019. Faspell: A fast, adaptable, sim-
5 Conclusion ple, powerful chinese spell checker based on dae-
decoder paradigm. In Proceedings of the 5th Work-
In this paper, we have proposed a novel neural shop on Noisy User-generated Text (W-NUT 2019),
network architecture for spelling error correction, pages 160–169.
Zhenzhong Lan, Mingda Chen, Sebastian Goodman,
Kevin Gimpel, Piyush Sharma, and Radu Soricut.
2019. Albert: A lite bert for self-supervised learn-
ing of language representations. arXiv preprint
arXiv:1909.11942.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.
Bruno Martins and Mário J. Silva. 2004. Spelling
correction for search engine queries. In Advances
in Natural Language Processing, 4th International
Conference, EsTAL 2004, Alicante, Spain, October
20-22, 2004, Proceedings, pages 372–383.
Yuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang, and
Hsin-Hsi Chen. 2015. Introduction to sighan 2015
bake-off for chinese spelling check. In Proceedings
of the Eighth SIGHAN Workshop on Chinese Lan-
guage Processing, pages 32–37.
Alex Wang, Amanpreet Singh, Julian Michael, Felix
Hill, Omer Levy, and Samuel R Bowman. 2018a.
Glue: A multi-task benchmark and analysis platform
for natural language understanding. arXiv preprint
arXiv:1804.07461.
Dingmin Wang, Yan Song, Jing Li, Jialong Han, and
Haisong Zhang. 2018b. A hybrid approach to auto-
matic corpus generation for chinese spelling check.
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
2517–2527.
Dingmin Wang, Yi Tay, and Li Zhong. 2019.
Confusionset-guided pointer networks for chinese
spelling check. In Proceedings of the 57th Annual
Meeting of the Association for Computational Lin-
guistics, pages 5780–5785.
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
bonell, Ruslan Salakhutdinov, and Quoc V Le.
2019. Xlnet: Generalized autoregressive pretrain-
ing for language understanding. arXiv preprint
arXiv:1906.08237.
Junjie Yu and Zhenghua Li. 2014. Chinese spelling
error detection and correction based on language
model, pronunciation, and shape. In Proceedings of
The Third CIPS-SIGHAN Joint Conference on Chi-
nese Language Processing, pages 220–223, Wuhan,
China. Association for Computational Linguistics.
Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and
Hsin-Hsi Chen. 2014. Overview of sighan 2014
bake-off for chinese spelling check. In Proceed-
ings of The Third CIPS-SIGHAN Joint Conference
on Chinese Language Processing, pages 126–132.
Shuiyuan Zhang, Jinhua Xiong, Jianpeng Hou, Qiao
Zhang, and Xueqi Cheng. 2015. Hanspeller++: A
unified framework for chinese spelling correction.
In Proceedings of the Eighth SIGHAN Workshop on
Chinese Language Processing, pages 38–45.