A Simple Yet Effective Training-Free Prompt-Free Approach
A Simple Yet Effective Training-Free Prompt-Free Approach
guage models (LLMs) for the Chinese spelling Next Token Prediction
correction (CSC) task, which is totally differ-
ent from all previous CSC approaches. The xiū xı̄ shuì gēn gēn
Table 2: Main Results. :: We reran the released code of ReaLiSe (Xu et al., 2021), along with their released
models, to obtain the results. ReaLiSe, was trained on the in-domain, gold-standard data of the Sighans dataset and
represents a SOTA model for it. The numbers in gray represent the out-of-domain results for ReaLiSe. Detailed
results of each sub-domain are provided in Appendix E.1.
correction F1 (C-F) and sentence-level false posi- manually annotated data, we report results from
tive rate (FPR) to provide a more complete view models specifically trained on it, serving as another
of the model performance. Details of the evaluation reference point.
metrics can be found in Appendix D.
3.4 Selection of LLMs
3.3 Baselines We conduct experiments on three open-source
We compare our approach against prompt-based LLMs: Baichuan2 (Yang et al., 2023a), Qwen1.5
method under two settings: zero-shot prompting (Bai et al., 2023), and InternLM2 (Cai et al., 2024).
(ZSP) and few-shot prompting (FSP). For few-shot For the main results, we select models with param-
settings, we select 10 examples from the Pseudo- eter sizes ranging from 10B to 20B to ensure that
Dev. The details of the prompts can be found in the LLMs have sufficient zero-shot and few-shot
Appendix C.2, and the example selection strategy capabilities for meaningful comparisons. Addi-
is described in Appendix C.3. During inference, tionally, we report the ZSP and FSP results of the
we adopt the greedy decoding strategy.2 widely recognized best-performing LLM family,
To provide a more comprehensive compari- GPT, including GPT-3.5 and GPT-4.
son, we also present results from state-of-the-art To simplify the analysis, we select the Bai-
domain-general CSC models trained on 34 million chuan2 7B as a representative model to investigate
pairs of synthetic CSC data for reference. These the impact of components in our approach.
models include Finetuned BERT (Devlin et al.,
3.5 Hyperparameters of Our Approach
2019), Softmasked BERT (Zhang et al., 2020),
and ReLM (Liu et al., 2024).3 We use the “Base” version of each LLM family.
Additionally, for datasets that have in-domain The distortion probabilities of distortion model
were derived from the statistics of the Pseudo-Dev
2
We observe that the improvement of beam search is dataset. We tuned α on Baichuan2 7B using the
marginal and sometimes even detrimental. Pseudo-Dev dataset. Eventually, α was set to 2.5
3
The results of these models were obtained by running
the released code along with the corresponding checkpoints for all experiments. During inference, we adopt
provided at https://github.com/gingasan/lemon.git. beam search with a beam size of 8.
System S-Fæ S-Pæ S-Ræ C-Fæ C-Pæ C-Ræ FPRç System
rSighan 15 Lemon Nov ECSpell Odw
rSighan 15 S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç
ReLM 55.5 61.1 50.8 61.0 78.5 49.9 9.5 7B 59.8 68.2 8.0 43.2 47.7 13.6 89.7 93.0 1.3
BC2
ZSP 42.0 41.7 42.3 47.8 42.5 54.6 25.8 13B 59.6 67.3 8.3 43.5 47.9 13.0 92.0 93.8 0.4
GPT3.5 0.5B 56.3 63.5 10.0 33.2 40.2 22.2 84.7 89.9 3.8
FSP 41.7 42.0 41.4 48.4 44.5 53.2 23.4
1.8B 58.3 65.3 10.3 35.6 42.3 19.9 90.3 92.8 1.7
ZSP 43.5 38.1 50.8 49.9 40.2 66.0 47.5
GPT4 4B 58.4 66.8 10.0 35.9 42.3 21.1 88.4 91.1 3.4
FSP 48.7 44.2 54.4 52.9 44.0 66.3 38.8 Q1.5
7B 59.4 67.0 8.5 39.0 44.7 19.0 87.1 91.4 3.4
BC2 13B┌┐ 59.6 66.5 54.0 67.3 78.3 59.0 8.3 14B 57.6 66.0 10.2 36.4 42.6 21.2 87.4 91.6 2.9
Q1.5 14B OUR 57.6 62.5 53.4 66.0 74.1 59.4 10.2 32B 57.2 65.8 10.0 36.6 42.2 19.4 88.2 91.9 2.9
IL2 20B└┘ 60.5 67.2 55.0 67.8 78.7 59.6 8.3 1.8B 55.3 64.0 12.2 33.2 40.1 22.6 88.3 91.0 2.1
Lemon Nov (1000) IL2 7B 58.1 65.5 10.2 38.8 44.2 18.0 89.3 92.0 2.1
ReLM 36.4 46.7 29.8 36.0 49.2 28.3 14.3 20B 60.5 67.8 8.3 40.5 45.3 15.1 91.1 93.8 0.4
ZSP 19.1 20.8 17.7 19.8 17.9 22.3 29.2
GPT3.5 Table 4: Ablation results of model size.
FSP 25.5 31.4 21.4 24.9 27.2 23.0 19.6
ZSP 30.6 28.4 33.1 33.4 26.9 44.1 33.5
GPT4
FSP 42.7 41.4 44.0 43.1 38.9 48.3 27.4
BC2 13B┌┐ 45.3 53.7 39.1 49.1 57.0 43.2 13.1 datasets. The results indicate that our approach
Q1.5 14B OUR 38.2 41.7 35.3 43.7 44.5 43.0 21.8 has a better generalization across different domains
IL2 20B└┘ 42.8 49.9 37.5 46.4 52.8 41.4 15.3 and genres than the current domain-general SOTAs.
ECSpell Odw However, our approach still largely lags behind
ReLM 66.5 67.5 65.6 73.0 86.4 63.1 7.1 the domain-specific SOTAs trained on the gold-
ZSP 58.2 62.5 54.5 61.0 62.7 59.3 4.6
GPT3.5 standard labeled data of each dataset.
FSP 59.3 64.1 55.2 60.7 62.4 59.0 2.4
ZSP 73.1 73.0 73.3 77.3 75.5 79.2 5.0 Compared to the GPT family, our approach con-
GPT4
FSP 73.2 73.5 72.9 78.5 78.3 78.7 5.0 sistently outperforms GPT3.5 on all three datasets,
BC2 13B┌┐ 92.0 94.4 89.7 93.8 95.6 92.1 0.4 and achieves better performance than GPT4 in most
Q1.5 14B OUR 87.4 88.6 86.3 91.6 91.8 91.3 2.9
cases. However, our approach may exhibit a lower
IL2 20B└┘ 91.1 92.9 89.3 93.8 95.9 91.8 0.4
C-R compared to GPT4, indicating that we might
Table 3: The comparison to GPT family on the rSighan miss some errors that GPT4 can correct.
15, Lemon Nov, and ECSpell Odw datasets. The
version of GPT3.5 is ‘gpt-3.5-turbo-0125’, GPT4 is 5 Discussion
‘gpt-4-0613’. BC2 is short for Baichuan2, Q1.5 for
Qwen1.5, and IL2 for InternLM2. 5.1 Impact of the Size of the LLM
First, we investigate the impact of the LLM size on
4 Main Results the performance of our approach.
As shown in Table 4, in general, larger LLMs
We present the main results in Table 2 and the com- tend to perform better than smaller ones within
parison to the GPT family in Table 3. Conducting a the same model family. However, the Qwen1.5
comprehensive evaluation of the GPT family is ex- model family is an exception: the performance
pensive, so we limit the comparison to a small-scale improvement becomes marginal when the model
study, focusing on the three datasets mentioned in size exceeds 1.8B parameters and even decreases
Section 3.1.4 Moreover, several qualitative exam- when the model size reaches 7B.
ples are provided in Appendix E.2 to illustrate the When comparing the performance of models of
performance of our approach. the same size across different model families, we
After applying our approach, all three LLM fam- find that the Baichuan2 family generally outper-
ilies outperforms their prompt-based counterparts forms the other two model families.
on all five datasets by a large margin.
Compared to the recent state-of-the-art domain- 5.2 Effectiveness of the Distortion Model
general CSC models, which are trained on 34M To investigate the effectiveness of the minimal dis-
synthetic CSC data, our approach also achieves tortion model, we first remove the distortion model
competitive or even superior performance on most pDM px | yq from the decoding process. Alter-
datasets, especially on the MCSCSet and ECSpell natively, we adopt a constrained text generation
4
(CTG) approach to correct the input sentence. For
The original Lemon-Nov dataset includes 6,000 sentences,
which is excessively large for our scope. Therefore, we se- each step, we limit the vocabulary to tokens that
lected the first 1,000 sentences for this comparison. are related to the corresponding characters in the
System S-Fæ S-Pæ S-Ræ C-Fæ C-Pæ C-Ræ FPRç System S-Fæ S-Pæ S-Ræ C-Fæ C-Pæ C-Ræ FPRç
rSighan 15 rSighan 15
CTG 6.7 5.3 9.1 7.7 4.2 47.7 90.0 Vanilla 18.0 15.9 20.6 20.7 14.3 37.6 52.9
OUR 59.8 66.0 54.7 68.2 77.8 60.6 8.0 w/ LR +39.4 +43.4 +35.0 +43.7 +53.3 +23.9 -38.4
- DT -7.7 -12.6 -3.9 -7.1 -15.7 -0.3 +9.4 w/ FR +3.8 +6.2 +0.8 +5.4 +8.3 -6.6 -19.3
- DT: -12.3 -18.2 -7.5 -9.8 -20.5 -1.2 +11.1 w/ Both +41.9 +50.1 +34.1 +47.4 +63.5 +23.0 -44.8
Lemon Nov Lemon Nov
CTG 0.7 0.5 1.1 1.4 0.7 22.5 96.2 Vanilla 19.4 18.0 20.9 23.6 17.1 38.3 38.5
OUR 43.2 52.2 36.9 47.7 55.5 41.9 13.6 w/ LR +17.1 +19.5 +14.6 +19.0 +21.9 +8.6 -13.7
- DT -12.3 -20.5 -6.8 -10.0 -20.8 -0.7 +13.9 w/ FR +9.0 +13.5 +4.7 +8.5 +13.5 -4.5 -18.8
- DT: -11.7 -20.5 -5.5 -9.7 -21.8 -1.6 +14.7 w/ Both +23.9 +34.2 +16.0 +24.1 +38.4 +3.6 -25.0
ECSpell Odw ECSpell Odw
CTG 29.3 24.5 36.3 21.4 12.4 79.5 52.9 Vanilla 65.3 65.3 65.3 70.4 65.4 76.2 10.1
OUR 89.7 91.6 87.8 93.0 95.3 90.8 1.3 w/ LR +25.4 +26.9 +24.0 +22.5 +28.5 +15.6 -9.7
- DT -4.0 -4.6 -3.4 -3.9 -5.8 -2.2 0.0 w/ FR +4.7 +11.2 -0.8 +7.5 +19.7 -4.5 -6.7
- DT: -16.3 -16.9 -15.7 -12.7 -14.5 -10.9 +2.5 w/ Both +24.4 +26.4 +22.5 +22.6 +29.9 +14.6 -8.8
Table 5: Ablation results of distortion model on Table 6: Ablation results of Baichuan2 7B. “LR” and
Baichuan2 7B. “CTG” means constrained text gener- “FR” represent “length reward” and “faithfulness reward”
ation. “-DT” represents that we do not distinguish respectively. “Both” means using both length reward
Same Pinyin, Similar Pinyin, and Similar Shape, and faithfulness reward.
and treat them as Related distortion. “-DT: ” represents
using the confusion set from Wang et al. (2018) to iden-
tify the Related distortion. faithfulness reward. The ablation study results of
the two rewards are shown in Table 6.
The results show that the length reward signifi-
input sentence,5 and let the model select the most cantly improves performance on all three datasets.
likely token from the constrained vocabulary. The This improvement can be attributed to increases in
results are shown in the “CTG” column in Table 5. both precision and recall, indicating that the length
We can see that the CTG performs poorly on all reward is crucial to our approach. The faithful-
datasets. This is because a Chinese character may ness reward mainly contributes to improving preci-
have many similar characters. Without the distor- sion, and it may slightly reduce recall. Overall, the
tion model, the model is prone to replacing the faithfulness reward balances the trade-off between
original character with a higher-frequency similar precision and recall, leading to a higher F1 score.
character, leading to a large number of errors. The combination of the two rewards can achieve
Next, we investigate the impact of the distortion better performance than using them separately, es-
type by treating three types of related but not identi- pecially when datasets contain less formal text,
cal distortions as a single distortion type. As shown more colloquial expressions, and more diverse
in the “- DT” column in Table 5, the performance named entities.
drops significantly but not as severely as when re-
moving the distortion model. This performance 5.4 Does Our Approach Work Well on
drop is mainly due to a decrease in precision. Simpler LMs?
We also examine the effectiveness of our rule-
Though our primary focus is on the performance
based tool for identifying related distortions. We
of our approach on LLMs, the language model
replace our rule-based tool with the confusion set
term of Equation 1 can be substituted with simpler
from Wang et al. (2018) to identify the related dis-
models, such as n-gram models, masked language
tortion. The results in the “- DT: ” column in Table 5
models, or small-scale causal language models. In
show that the confusion set from Wang et al. (2018)
this subsection, we investigate the performance of
is less effective than our rule-based tool, leading to
our approach using these simpler language models.
more severe performance degradation.
The LMs we investigate include: n-gram LM:
5.3 Impact of Two Rewards KLM,6 a 5-gram language model trained on the Chi-
In this work, we propose two rewards to optimize nese Gigaword corpus; Masked LM: BERT,7 a
the decoding process: the length reward and the bidirectional language model pre-trained using the
5 6
Classified as Identical, Same Pinyin, Similar Pinyin, shibing624/chinese-kenlm-klm
7
or Similar Shape. bert-base-chinese
rSighan 15 Lemon Nov ECSpell Odw System S-Fæ C-Fæ FPRç
System
S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç ORI 35.3 48.5 7.5
Finetuned BERT
BC2 13B 59.6 67.3 8.3 43.5 47.9 13.0 92.0 93.8 0.4 w/ k +0.7 +1.7 +0.1
Q1.5 14B 57.6 66.0 10.2 36.4 42.6 21.2 87.4 91.6 2.9 ORI 35.3 48.5 8.1
IL2 20B 60.5 67.8 8.3 40.5 45.3 15.1 91.1 93.8 0.4 Softmasked BERT
w/ k +1.2 +2.1 -0.5
KLM 29.3 38.9 33.8 5.8 9.4 65.8 58.3 65.3 23.5 ORI 37.8 50.2 6.8
BERT 110M 31.3 34.0 0.2 13.3 12.5 0.6 59.1 63.6 0.0 ReLM
w/ k +0.9 +1.9 -0.2
GPT2 102M 55.0 64.7 8.1 26.1 30.8 28.4 78.6 85.0 5.4 OUR 66.0 76.9 1.7
Baichuan2 13B
w/ k +5.1 +5.4 -0.2
Table 7: Results of applying our approach to simpler OUR 61.1 72.6 3.1
LMs. Qwen1.5 14B
w/ k +9.1 +8.8 -1.0
OUR 63.2 72.9 2.6
InternLM2 20B
w/ k +4.8 +5.4 -0.0
mask filling task and next sentence prediction task;
Small causal LM: GPT2,8 a small-scale causal lan- Table 8: The results of introducing new knowledge by
guage model (about 102M parameters) trained on adding a prefix k to the input on the MCSCSet. “ORI”
the CLUECorpusSmall (about 5B characters). denotes the original input without any prefix.
The results are shown in Table 7. From these
results, we can see that our approach also works knowledge into the LLM by adding a simple input
with simpler LMs. In the ECSpell-Odw dataset, our prefix k “ “患者提问:” (“A patient asks:”).
approach enables simpler language models (LMs) The results in Table 8 demonstrate that introduc-
to achieve sentence- and character-level correction ing new knowledge into the LLM by merely modi-
F1 scores higher than 50% and 60%, respectively. fying the input prefix can significantly improve the
However, the performance of our approach on sim- model’s performance on the CSC task.
pler LMs still lags significantly behind that of the
large language models (LLMs), highlighting the We provide a real case from the MCSCSet to
importance of the scale of pre-training data and explain why this method works.
model size. Consider the sentence “未挨前兆” (wèi āi qián
zhào, “without being near any prior warnings”),
5.5 How to Introduce New Knowledge into which should be corrected to “胃癌前兆” (wèi ái
Our Approach? qián zhào, “early symptoms of stomach cancer”)
in the medical domain. This sentence contains
The LLM part of our approach offers a straightfor-
only four characters, insufficient to provide enough
ward way to incorporate new knowledge without
context for accurate correction, even for humans.
the need for further training, by adding some
CSC models fail to correct this sentence or sug-
text that describes the new knowledge as an in-
gest incorrect corrections, such as “未 提 前 兆”
put prefix.
(wèi tí qián zhào, “did not provide prior warn-
Given the new knowledge k, Equation 1 can be
ings”) or “未按前兆” (wèi àn qián zhào, “not ac-
adjusted from ppx, yq to ppx, y | kq. We then have:
cording to the prior warnings”). However, if we
add the prefix “患者提问:” (“A patient asks:”),
ppx, y | kq “ ppx | y, kq ppy | kq
(8) which provides the knowledge that the sentence is
« pDM px | yq pLLM py | kq, a patient’s question about a medical condition, the
model can correctly predict “胃癌前兆”.
where, by assuming x and k are conditionally in-
In addition to this simple experiment, we also
dependent given y, we approximate ppx | y, kq as
provide an experiment in Appendix F.6 to show
pDM px | yq. The second term, pLLM py | kq, can be
that we can use the context as new knowledge to
calculated by the LLM using the input prefix k.
improve the performance of the CSC model in real-
To illustrate this point, we conducted a simple
world applications.
experiment introducing domain and text format
information as new knowledge into our approach. 5.6 More Discussions
We chose the MCSCSet for this experiment, as the
sentences in it share a common characteristic: they Due to space constraints, some interesting discus-
are questions from patients. We can introduce this sions have been moved to the Appendix. These
include: a discussion on how the pre-training data
8
uer/gpt2-chinese-cluecorpussmall of the LLM affects the performance (F.1); a compar-
ison between our approach and the SFT approach (2023a) were the first to investigate the prompt-
(F.2); an analysis of the influence of beam size on based approach under various settings. Building
the performance (F.3); an exploration of whether on this work, Dong et al. (2024) proposed enrich-
the imperfect estimation of the distortion model ing prompts with additional information, such as
impacts the performance (F.4); and a brief runtime pronunciation and character glyphs. Compared
analysis (F.5). to the prompt-based approach, the SFT-based ap-
proach has been shown to be more effective (Li
6 Related Works et al., 2023a). However, the performance of SFT-
based LLMs still falls significantly behind pre-
6.1 Chinese Spelling Check
LLM methods. Li et al. (2024) argue that this
Previous research on the CSC task can be divided underperformance is due to the mixed character-
into three eras, accompanied with paradigm shift. word tokenization used by LLMs for Chinese text.
To address this issue, they suggest replacing mixed
The Early Unsupervised Era Early CSC ap-
tokenization with character-level tokenization be-
proaches mainly utilized unsupervised pipeline sys-
fore training LLMs on the CSC dataset.
tems (Yeh et al., 2013; Yu et al., 2014; Yu and Li,
In contrast to these methods, our approach re-
2014; Huang et al., 2014; Xie et al., 2015). These
quires neither prompts nor additional training.
systems typicaly act in three main steps: error de-
tection, candidate correction generation from a con-
fusion set, and candidate ranking using a statistical 6.2 Decoding Methods of LLMs
n-gram language model. Intervening in the decoding process is a common
approach to improve LLMs’ task-specific perfor-
The Supervised Learning Era By 2018, the ad-
mance. There are two popular approaches in this
vent of techniques for automatically generating
category: Contrastive decoding and Constrained
pseudo-labeled data had begun to address the chal-
decoding. Contrastive decoding (Li et al., 2023b)
lenge of data scarcity in CSC (Wang et al., 2018),
refines the output probabilities by comparing the
marking a shift in the paradigm of CSC research to-
output probabilities of expert and amateur mod-
wards a supervised learning era dominated by deep
els (O’Brien and Lewis, 2023; Shi et al., 2023).
neural networks. This era saw researchers explor-
Constrained decoding, on the other hand, uses con-
ing various avenues to enhance CSC performance.
straints to guide the decoding process, making the
Some focused on finding better model architectures
output more aligned with the task-specific require-
(Zhang et al., 2020; Zhu et al., 2022), while oth-
ments (Wang et al., 2023; Geng et al., 2023).
ers delved into more effective training strategies
(Liu et al., 2022; Wu et al., 2023; Liu et al., 2024). Our work is closely related to the constrained
Additionally, there was an effort to enrich models decoding approaches, where a distortion model is
with information beyond text, such as phonetic or used to influence the LLM decoding process.
visual features (Cheng et al., 2020; Xu et al., 2021;
Li et al., 2022; Liang et al., 2023). 7 Conclusion
Similar to our work, Wu et al. (2023) also decom-
posed ppx | yq into two parts to improve CSC per- In this work, we propose a simple, training-free,
formance. However, they achieved this by adding and prompt-free approach to leverage LLMs for
an auxiliary training loss. Our work stands out by the CSC task. Two components, a large language
using an off-the-shelf LLM as the backbone and model and a minimal distortion model, co-operate
a minimal distortion model to achieve good CSC to correct spelling errors. We alleviate the local
performance without any additional training. optima problem and over-correction issue, with
two simple strategies, length reward and faithful-
The Era of LLMs Our work represents an ini- ness reward, respectively. Our comprehensive
tial foray into what can be considered the third era experiments have shown that our approach sig-
of CSC research: the era of LLMs. This phase nificantly improves LLM performance. Through
explores the potential of LLMs in addressing the our approach, LLMs demonstrate remarkable do-
CSC task. As discussed in the introduction, related main generalization capabilities, surpassing SOTA
studies in this era fall into two main categories: domain-general CSC models, that are trained on
prompt-based and supervised fine-tuning. Li et al. extensive synthetic CSC data, on most datasets.
Limitations References
Feasibility The scope of this study is limited to Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang,
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei
the task of Chinese spelling correction, which is a Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin,
subset of text error correction. Most of our design Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu,
choices are tailored to the characteristics of Chinese Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren,
and the specific requirements of the CSC task. Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong
Tu, Peng Wang, Shijie Wang, Wei Wang, Sheng-
However, our approach has the potential to be guang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang,
directly applied to some other languages. For ex- Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu,
ample, in Japanese and Korean, we can also cate- Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingx-
gorize errors into phonetic similarities, such as (や, uan Zhang, Yichang Zhang, Zhenru Zhang, Chang
ya)-(な, na) in Japanese or (ᄒ ᅮ, hu)-(ᄇ ᅮ, bu) in Ko- Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang
Zhu. 2023. Qwen technical report. ArXiv preprint,
rean, and shape similarities, like (ュ, yu)-(ェ, e) in abs/2309.16609.
Japanese. For languages using a phonetic writing
system, like English, minor adjustments such as Zuyi Bao, Chen Li, and Rui Wang. 2020. Chunk-based
adding INSERT, DELETE, and REORDER operations Chinese spelling check with global optimization. In
Proceedings of EMNLP, pages 2031–2040, Online.
will be sufficient to make it work.
Comparatively, handling complex text errors that Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao
involve grammar, semantics, or pragmatics, are Zheng. 2017. AISHELL-1: an open-source man-
more challenging. To tackle these errors, one could darin speech corpus and a speech recognition base-
line. In 20th Conference of the Oriental Chapter of
design an appropriate distortion model, though it
the International Coordinating Committee on Speech
might necessitate the adoption of more intricate Databases and Speech I/O Systems and Assessment,
rules or the implementation of a model based on O-COCOSDA 2017, Seoul, South Korea, November
neural networks. In our future work, we aim to 1-3, 2017, pages 1–5.
explore ways that would allow our approach to
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen,
handle these complex errors. Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi
Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan,
Computational Cost Our approach requires the Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe
use of LLMs, which introduces additional com- Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He,
putational costs. However, many existing tech- Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao,
niques, such as quantization (Frantar et al., 2022; Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li,
Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hong-
Lin et al., 2024), pruning (Ma et al., 2023; Zhu wei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu,
et al., 2024), distillation (Hsieh et al., 2023), and Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv,
more efficient framework implementations (Dao, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang
2023; Yang et al., 2024), can be directly applied to Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, Fukai
Shang, Yunfan Shao, Demin Song, Zifan Song, Zhi-
our method to reduce these costs. hao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang,
Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang,
Acknowledgements Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen
Weng, Fan Wu, Yingtong Xiong, Chao Xu, Ruil-
First and foremost, we would like to express our iang Xu, Hang Yan, Yirong Yan, Xiaogui Yang,
deepest gratitude to all anonymous reviewers for Haochen Ye, Huaiyuan Ying, Jia Yu, Jing Yu, Yuhang
their invaluable time and constructive comments Zang, Chuyu Zhang, Li Zhang, Pan Zhang, Peng
Zhang, Ruijie Zhang, Shuo Zhang, Songyang Zhang,
on our paper. We would also like to thank Chen Wenjian Zhang, Wenwei Zhang, Xingcheng Zhang,
Gong, Tong Zhu, Shilin Zhou, and Yu Zhang for Xinyue Zhang, Hui Zhao, Qian Zhao, Xiaomeng
their help in polishing our paper. Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo,
This work was supported by National Natural Yicheng Zou, Xipeng Qiu, Yu Qiao, and Dahua Lin.
2024. Internlm2 technical report. ArXiv preprint,
Science Foundation of China (Grant No. 62176173 abs/2403.17297.
and 62261160648), Alibaba Group through Al-
ibaba Innovative Research Program, and a Project Xingyi Cheng, Weidi Xu, Kunlong Chen, Shaohua
Funded by the Priority Academic Program Devel- Jiang, Feng Wang, Taifeng Wang, Wei Chu, and Yuan
Qi. 2020. SpellGCN: Incorporating phonological and
opment (PAPD) of Jiangsu Higher Education Insti- visual similarities into language models for Chinese
tutions. spelling check. In Proceedings of ACL, pages 871–
881, Online.
Tri Dao. 2023. Flashattention-2: Faster attention with automatic speech recognition. In Proceedings of
better parallelism and work partitioning. ArXiv EMNLP, pages 4328–4337, Punta Cana, Dominican
preprint, abs/2307.08691. Republic.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Yichong Leng, Xu Tan, Linchen Zhu, Jin Xu, Renqian
Kristina Toutanova. 2019. BERT: Pre-training of Luo, Linquan Liu, Tao Qin, Xiangyang Li, Edward
deep bidirectional transformers for language under- Lin, and Tie-Yan Liu. 2021a. Fastcorrect: Fast error
standing. In Proceedings of NAACL-HLT, pages correction with edit alignment for automatic speech
4171–4186, Minneapolis, Minnesota. recognition. In Advances in NeurIPS, pages 21708–
21719.
Ming Dong, Yujing Chen, Miao Zhang, Hao Sun, and
Tingting He. 2024. Rich semantic knowledge en-
hanced large language models for few-shot Chinese Jiahao Li, Quan Wang, Zhendong Mao, Junbo Guo,
spell checking. ArXiv preprint, abs/2403.08492. Yanyan Yang, and Yongdong Zhang. 2022. Improv-
ing Chinese spelling check by character pronuncia-
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and tion prediction: The effects of adaptivity and granu-
Dan Alistarh. 2022. GPTQ: accurate post-training larity. In Proceedings of EMNLP, pages 4275–4286,
quantization for generative pre-trained transformers. Abu Dhabi, United Arab Emirates.
ArXiv preprint, abs/2210.17323.
Kunting Li, Yong Hu, Liang He, Fandong Meng, and
Saibo Geng, Martin Josifoski, Maxime Peyrard, and Jie Zhou. 2024. C-LLM: learn to check chinese
Robert West. 2023. Grammar-constrained decoding spelling errors character by character. ArXiv preprint,
for structured NLP tasks without finetuning. In Pro- abs/2406.16536.
ceedings of EMNLP, pages 10932–10952, Singapore.
Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang,
Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Jason Eisner, Tatsunori Hashimoto, Luke Zettle-
Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay moyer, and Mike Lewis. 2023b. Contrastive decod-
Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Dis- ing: Open-ended text generation as optimization. In
tilling step-by-step! outperforming larger language Proceedings of ACL, pages 12286–12312, Toronto,
models with less training data and smaller model Canada.
sizes. In Findings of ACL, pages 8003–8017, Toronto,
Canada.
Yinghui Li, Haojing Huang, Shirong Ma, Yong Jiang,
Yong Hu, Fandong Meng, and Jie Zhou. 2024. CSCD- Yangning Li, Feng Zhou, Hai-Tao Zheng, and Qingyu
NS: a Chinese spelling check dataset for native Zhou. 2023a. On the (in)effectiveness of large lan-
speakers. In Proceedings of ACL, pages 146–159, guage models for Chinese text correction. ArXiv
Bangkok, Thailand. preprint, abs/2307.09007.
Qiang Huang, Peijie Huang, Xinrui Zhang, Weijian Xie, Zihong Liang, Xiaojun Quan, and Qifan Wang. 2023.
Kaiduo Hong, Bingzhou Chen, and Lei Huang. 2014. Disentangled phonetic representation for Chinese
Chinese spelling check system based on tri-gram spelling correction. In Proceedings of ACL, pages
model. In Proceedings of CIPS-SIGHAN, pages 173– 13509–13521, Toronto, Canada.
178, Wuhan, China.
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-
Wangjie Jiang, Zhihao Ye, Zijing Ou, Ruihui Zhao, Ming Chen, Wei-Chen Wang, Guangxuan Xiao,
Jianguang Zheng, Yi Liu, Bang Liu, Siheng Li, Yu- Xingyu Dang, Chuang Gan, and Song Han. 2024.
jiu Yang, and Yefeng Zheng. 2022. Mcscset: A AWQ: activation-aware weight quantization for on-
specialist-annotated dataset for medical-domain Chi- device LLM compression and acceleration. In Pro-
nese spelling correction. In Proceedings of CIKM, ceedings of MLSys.
pages 4084–4088.
Yichong Leng, Xu Tan, Wenjie Liu, Kaitao Song, Rui Linfeng Liu, Hongqiu Wu, and Hai Zhao. 2024. Chi-
Wang, Xiang-Yang Li, Tao Qin, Edward Lin, and Tie- nese spelling correction as rephrasing language
Yan Liu. 2023. Softcorrect: Error correction with soft model. In Proceedings of the AAAI, pages 18662–
detection for automatic speech recognition. In Thirty- 18670.
Seventh AAAI Conference on Artificial Intelligence,
AAAI 2023, Thirty-Fifth Conference on Innovative Shulin Liu, Shengkang Song, Tianchi Yue, Tao Yang,
Applications of Artificial Intelligence, IAAI 2023, Huihui Cai, TingHao Yu, and Shengli Sun. 2022.
Thirteenth Symposium on Educational Advances in CRASpell: A contextual typo robust approach to
Artificial Intelligence, EAAI 2023, Washington, DC, improve Chinese spelling correction. In Findings of
USA, February 7-14, 2023, pages 13034–13042. ACL, pages 3008–3018, Dublin, Ireland.
Yichong Leng, Xu Tan, Rui Wang, Linchen Zhu, Jin Xu, Qi Lv, Ziqiang Cao, Lei Geng, Chunhui Ai, Xu Yan, and
Wenjie Liu, Linquan Liu, Xiang-Yang Li, Tao Qin, Guohong Fu. 2023. General and domain-adaptive
Edward Lin, and Tie-Yan Liu. 2021b. FastCorrect Chinese spelling check with error-consistent pretrain-
2: Fast error correction on multiple candidates for ing. TALLIP, 22(5).
Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023. Yiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang,
Llm-pruner: On the structural pruning of large lan- Zenan Zhou, and Zhiying Wu. 2023a. Baichuan 2:
guage models. In Advances in NeurIPS, volume 36, Open large-scale language models. ArXiv preprint,
pages 21702–21720. abs/2309.10305.
Sean O’Brien and Mike Lewis. 2023. Contrastive de- Liner Yang, Xin Liu, Tianxin Liao, Zhenghao Liu,
coding improves reasoning in large language models. Mengyan Wang, Xuezhi Fang, and Erhong Yang.
ArXiv preprint, abs/2309.09117. 2023b. Is Chinese spelling check ready? understand-
ing the correction behavior in real-world scenarios.
Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia AI Open, 4:183–192.
Tsvetkov, Luke Zettlemoyer, and Scott Wen-tau
Yih. 2023. Trusting your evidence: Hallucinate Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen,
less with context-aware decoding. ArXiv preprint, and Yoon Kim. 2024. Parallelizing linear transform-
abs/2305.14739. ers with the delta rule over sequence length. ArXiv
preprint, abs/2406.06484.
Yuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang, and
Hsin-Hsi Chen. 2015. Introduction to SIGHAN 2015 Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-
bake-off for Chinese spelling check. In Proceedings Yi Chen, and Mao-Chuan Su. 2013. Chinese word
of SIGHAN, pages 32–37, Beijing, China. spelling correction based on n-gram ranked inverted
index list. In Proceedings of SIGHAN, pages 43–48,
Bailin Wang, Zi Wang, Xuezhi Wang, Yuan Cao, Rif A. Nagoya, Japan.
Saurous, and Yoon Kim. 2023. Grammar prompting
for domain-specific language generation with large Junjie Yu and Zhenghua Li. 2014. Chinese spelling er-
language models. ArXiv preprint, abs/2305.19234. ror detection and correction based on language model,
pronunciation, and shape. In Proceedings of CIPS-
Dingmin Wang, Yan Song, Jing Li, Jialong Han, and SIGHAN, pages 220–223, Wuhan, China.
Haisong Zhang. 2018. A hybrid approach to auto-
matic corpus generation for Chinese spelling check. Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and
In Proceedings of EMNLP, pages 2517–2527, Brus- Hsin-Hsi Chen. 2014. Overview of SIGHAN 2014
sels, Belgium. bake-off for Chinese spelling check. In Proceedings
of CIPS-SIGHAN, pages 126–132, Wuhan, China.
Hongqiu Wu, Shaohua Zhang, Yuchen Zhang, and Hai
Zhao. 2023. Rethinking masked language modeling Shaohua Zhang, Haoran Huang, Jicong Liu, and Hang
for Chinese spelling correction. In Proceedings of Li. 2020. Spelling error correction with soft-masked
ACL, pages 10743–10756, Toronto, Canada. BERT. In Proceedings of ACL, pages 882–890, On-
line.
Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013.
Chinese spelling check evaluation at SIGHAN bake- Chenxi Zhu, Ziqiang Ying, Boyu Zhang, and Feng Mao.
off 2013. In Proceedings of SIGHAN, pages 35–42, 2022. MDCSpell: A multi-task detector-corrector
Nagoya, Japan. framework for Chinese spelling correction. In Find-
ings of ACL, pages 1244–1253, Dublin, Ireland.
Weijian Xie, Peijie Huang, Xinrui Zhang, Kaiduo Hong,
Qiang Huang, Bingzhou Chen, and Lei Huang. 2015. Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan,
Chinese spelling check system based on n-gram Jingqi Tong, Conghui He, and Yu Cheng. 2024.
model. In Proceedings of SIGHAN, pages 128–136, Llama-moe: Building mixture-of-experts from
Beijing, China. llama with continual pre-training. ArXiv preprint,
abs/2406.16554.
Heng-Da Xu, Zhongli Li, Qingyu Zhou, Chao Li,
Zizhen Wang, Yunbo Cao, Heyan Huang, and Xian-
Ling Mao. 2021. Read, listen, and see: Leveraging
multimodal information helps Chinese spell check-
ing. In Proceedings of ACL-IJCNLP, pages 716–728,
Online.
Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang,
Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang,
Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng
Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao,
Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Ji-
aming Ji, Jian Xie, Juntao Dai, Kun Fang, Lei Su,
Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang
Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Pei-
dong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li,
Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong
Zeng, Xiaochuan Wang, Xiaoxi Chen, Xin Men,
Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang,
A Special Acknowledgements Corrected Ñ Input
j Ñ ● q x z ■ ✕ ■✕ ■ ✕ ■
✕ ■
✕
We would like to extend our special thanks to all q Ñ j ● x ■ ✕ c ■ ✕ ■ ✕ ✕
■ ✕
■
anonymous reviewers for their valuable comments x Ñ j q ● ■ ✕ ■✕ s ■ ✕ ✕
■ ✕
■
and suggestions. z Ñ j ✕ ■
■ ✕ ● c s zh ✕
■ ✕
■
Reviewer gUCq highlighted unclear descrip- c Ñ ✕
■ q ■✕ z ● s ■ ✕ ch ✕
■
tions and missing experiments, such as the com- s Ñ ✕
■ ✕ ■
■ ✕ z c ● ■ ✕ ✕
■ sh
parison with simpler LMs, in our initial version of zh Ñ ✕
■ ✕ ■
■ ✕ z ■ ✕ ■✕ ● ch sh
the paper. The revisions made in response to these ch Ñ ✕
■ ✕ ■
■ ✕ ■ ✕ c ■ ✕ zh ● sh
suggestions have significantly improved the quality sh Ñ ✕
■ ✕ ■
■ ✕ ■ ✕ ■✕ s zh ch ●
of our work. r Ñ ● l ■✕ ■ ✕ ■✕ ■✕ ■ ✕ ✕
■
Reviewer B44m provided strong positive feed- l Ñ r ● n d t ■ ✕ ■ ✕ ✕
■
back on our work while also identifying missing n Ñ ✕
■ l ● d t ■ ✕ ■ ✕ ✕
■
details in the experimental setup, the absence of d Ñ ✕
■ l n ● t b ■ ✕ ✕
■
results on few-shot settings with GPT-4, and other t Ñ ✕
■ l n d ● ■ ✕ p ✕
■
aspects. Addressing these points made our paper b Ñ ✕
■ ✕ ■
■ ✕ d ■ ✕ ● p m
more comprehensive and rigorous. p Ñ ✕
■ ✕ ■
■ ✕ ■ ✕ t b ● ✕
■
Reviewer cWnK raised concerns about the flex- m Ñ ✕
■ ✕ ■
■ ✕ ■ ✕ ■✕ b p ●
ibility of introducing new knowledge. This insight- g Ñ ● k h ■ ✕
ful comment motivated us to further explore the k Ñ g ● h ■ ✕
topic and provide a simple solution in §5.5. h Ñ g k ● f
Reviewer cnvj, wY4T, and QXDJ gave our work f Ñ ✕
■ ✕ h ●
■
high evaluations and provided numerous construc-
tive comments and suggestions. Their recognition Table 9: Consonants with similar pronunciation.
encourages us to continue refining our paper.
Table 11: The statistics of the datasets used in the experiments. Recall Upper Bound represents the sentence-level
upper bound of the recall under the distortion model that we use in this work.
(2023b), which has manually verified and corrected tracts, news, and novels. The original dataset also
the errors in the original dataset. includes sighan 15 as a subset, which we have con-
‚ CSCD-NS: A real-world Chinese social me- sidered as a part of the Sighan series and excluded
dia corpus collected and annotated by Hu et al. from Lemon.
(2024). It can better represent the variety of texts The detailed statistics of these datasets are shown
found in real-world settings and includes a broad in Table 11.
spectrum of errors. The recall upper bound in the statistics is ob-
‚ MCSCSet: A large-scale corpus from the tained by calculating the number of sentences
medical domain, collected and annotated by Jiang that can potentially be fully corrected out of the
et al. (2022). It features numerous errors specific total number of sentences in the dataset. A
to medical terminology, making it an excellent re- sentence has the potential to be fully corrected
source for evaluating models’ generalization capa- if all the distortion types between each pair of
bilities in this area. source and target characters can be categorized into
‚ ECSpell: A small-scale, multi-domain corpus Identical, Same Pinyin, Similar Pinyin, and
annotated by Lv et al. (2023). It encompasses three Similar Shape.
domains: legal documents, medical treatments, and
official document writing. C.2 Implementation Details of Prompt-based
‚ Lemon: The most recent and largest multi- Method
domain corpus to date, collected and annotated by In this work, we use the prompt-based method to
Wu et al. (2023). It spans seven domains: law, activate the CSC ability of the baseline LLMs. The
medicine, encyclopedia, gaming, automotive, con- task-specific instructions are adopted from Li et al.
System and User Prompts for baselines
System Prompt:
你是一个优秀的中文拼写纠错模型,中文拼写纠错模型即更正用户输入句子中的拼写错
误。
User Prompt:
你需要识别并纠正用户输入的句子中可能的错别字并输出正确的句子,纠正时必须保证改
动前后句子必须等长,在纠正错别字的同时尽可能减少对原句子的改动(不添加额外标点符
号,不添加额外的字,不删除多余的字)。只输出没有错别字的句子,不要添加任何其他解
释或说明。如果句子没有错别字,就直接输出和输入相同的句子。
(2023a). The prompt used for the baselines are Model Version Strategy
shown in Figure 4. We disable the sampling mecha- Baichuan2 13B Base Similariy
nism and set the temperature to 0.0 to ensure deter- Qwen1.5 14B Base Balanced
ministic decoding. For few-shot prompting meth- InternLM2 20B Chat Similariy
ods, where the example selection strategy involves GPT3.5 – Balanced
random selection, we conduct three runs and report GPT4 – Balanced
the average results. The only exception is the GPT4
model, which we run only once due to the high cost Table 12: The model version and examples selection
strategy we used for few-shot baseline.
of using the model.
C.3 Few-shot Examples Selection Strategy for examples, which contains a higher proportion of
Baselines erroneous sentences (87%–94%) compared to the
Li et al. (2023a) proposed three selection strategies target data (50%–56%). This discrepancy causes
for CSC few-shot prompting methods: 1) Random: the GPT models to be more aggressive in correcting
randomly select m examples; 2) Balanced: ran- errors.
domly select m examples with a balanced distribu- To ensure the effectiveness of the few-shot
tion of correct and error examples; 3) Similarity: prompting method, we conducted experiments to
select the m most similar in-context examples for determine the optimal strategy for each LLM we
each input sentence using the BM25 and Rouge sim- used. For open-source LLMs, which include both
ilarity metrics. ‘Base’ and ‘Chat’ versions, we experimented with
They found that the performance of few-shot both versions and selected the best one for each
prompting depends on the selection of in-context LLM. The final choice of selection strategy is
examples. Different selection strategies may lead shown in Table 12.
to distinct results. Among the three strategies, C.4 Pre- & Post-processing for Baselines
Similarity was found to be the most effective.
In this study, we employ several pre- and post-
However, the Similarity strategy is not always
processing techniques to mitigate the errors intro-
the optimal choice. In preliminary experiments,
duced by the limitations of baseline systems. This
we observed that this strategy sometimes causes
ensures a fair comparison between our approach
GPT family models to perform worse than the zero-
and the baselines.
shot prompting method. Upon analyzing the re-
sults, we found that GPT models are particularly BERT-based baselines Most current CSC mod-
sensitive to discrepancies in the proportion of erro- els utilize BERT as the backbone. However,
neous sentences between the few-shot prompting BERT presents challenges that can degrade per-
examples and the target data. The examples se- formance during evaluation: 1) Full-width Punctu-
lected using the Similarity strategy tend to have ation: BERT’s tokenization process may normalize
a similar proportion of erroneous sentences as the full-width punctuation to half-width, leading to nu-
dataset used for selection. In our work, we use merous unnecessary punctuation replacements. To
Pseudo-Dev dataset to select few-shot prompting counter this, we prevent the model from modify-
Datasets rSighans ECSpell Lemon
Subsets Y13 Y14 Y15 Law Med Odw Car Cot Enc Gam Med New Nov
Domain-Specific SOTAs (Trained on in-domain gold-standard data of each dataset)
ReaLiSe 70.1 64.0 73.9 38.9 23.1 42.8 32.5 40.1 29.1 12.6 31.8 31.2 20.2
Liu et al. (2024) – – – 91.2 82.4 83.6 – – – – – – –
Domain-General SOTAs (Trained on about 34M synthetic CSC data)
Finetuned BERT 50.6 40.4 51.6 58.5 47.8 65.1 52.0 63.1 45.3 32.8 50.7 56.1 35.8
Softmasked BERT 51.6 40.2 51.3 58.5 48.5 65.9 52.3 63.8 44.1 28.3 48.9 55.6 37.7
ReLM 45.8 40.6 55.5 60.4 50.9 66.5 53.3 66.7 47.7 33.7 53.8 58.8 37.1
LLMs (without CSC-specific training)
ZSP 26.4 12.0 18.5 37.6 23.0 43.0 15.3 14.9 24.0 12.7 21.6 19.8 14.1
Baichuan2 FSP 41.1 23.1 31.3 60.2 50.4 60.0 32.2 45.3 38.9 24.6 39.0 39.7 26.4
(13B)
OUR 63.6 54.1 59.6 82.6 78.9 92.0 52.7 62.9 51.9 37.1 60.1 63.9 43.5
ZSP 41.6 17.4 28.1 53.3 38.9 60.7 28.5 42.0 33.8 20.5 35.3 37.3 25.3
Qwen1.5 FSP 45.9 25.4 31.6 61.4 49.1 66.5 35.0 47.6 43.4 27.9 38.6 38.7 29.2
(14B)
OUR 56.9 48.6 57.6 84.1 73.2 87.4 46.0 59.9 44.6 28.3 52.9 55.8 36.4
ZSP 42.3 20.9 29.7 47.7 31.9 55.9 29.8 42.6 34.3 21.2 40.0 34.7 27.2
InternLM2 FSP 55.9 27.7 32.9 45.9 38.2 65.3 31.3 46.7 37.1 25.4 43.4 37.9 29.3
(20B)
OUR 57.8 53.1 60.5 83.9 72.3 91.1 49.7 59.0 48.2 31.8 55.9 63.3 40.5
ing the original punctuation; 2) Special Tokens: maximum length of 128 characters and concatenate
BERT-based models may predict a special ‘[UNK]‘ the remaining characters to the output.
token in some cases, resulting in the removal of
the original character. In these instances, we retain LLM baselines The outputs of LLMs some-
the original character when a special token is pre- times fail to align with evaluation, primarily due
dicted; 3) Input Length Limitation: BERT-based to their inadequate instruction-following capabil-
models show limited generalization beyond their ity. To address this, we apply specific rules for
maximum training length. We truncate inputs to a post-processing: 1) Redundant Phrases: We re-
move redundant phrases such as “修 改 后 的 句
Datasets rSighans ECSpell Lemon
Subsets Y13 Y14 Y15 Law Med Odw Car Cot Enc Gam Med New Nov
Domain-Specific SOTAs (Trained on in-domain gold-standard data of each dataset)
ReaLiSe 13.0 9.6 7.7 10.6 18.6 11.8 20.9 13.4 20.8 22.5 16.5 16.7 22.6
Liu et al. (2024) – – – 7.4 6.5 2.2 – – – – – – –
Domain-General SOTAs (Trained on about 34M synthetic CSC data)
Finetuned BERT 21.7 16.5 12.5 4.9 11.3 2.9 12.3 8.3 13.9 22.5 8.3 9.4 17.3
Softmasked BERT 13.0 17.6 14.5 6.1 11.7 5.0 12.4 7.1 14.8 20.4 9.6 10.6 16.6
ReLM 4.4 15.0 9.5 7.8 11.0 7.1 12.1 5.6 12.6 20.8 5.7 8.4 17.5
LLMs (without CSC-specific training)
ZSP 34.8 58.3 54.4 26.9 43.1 21.0 40.6 54.2 35.9 41.6 35.4 41.1 37.6
Baichuan2 FSP 21.7 19.4 23.2 7.8 9.1 0.4 8.3 7.4 10.2 20.0 4.6 8.3 7.7
(13B)
OUR 8.7 14.1 8.3 4.5 9.9 0.4 5.9 6.9 8.9 19.2 3.9 5.7 13.0
ZSP 34.8 54.4 34.2 5.7 35.4 2.1 18.5 15.8 13.5 18.4 11.8 14.0 20.7
Qwen1.5 FSP 15.9 30.9 31.7 5.3 11.6 0.8 8.9 12.7 10.1 14.7 9.5 7.8 5.5
(14B)
OUR 21.7 19.6 10.2 4.9 11.7 2.9 11.2 6.3 14.8 29.4 5.4 10.1 21.2
ZSP 65.2 58.0 48.8 26.5 50.7 17.7 28.8 23.7 30.0 30.6 23.0 34.0 24.2
InternLM2 FSP 21.7 39.8 33.6 13.9 30.7 2.5 18.2 12.3 18.1 23.7 10.0 22.4 16.1
(20B)
OUR 13.0 16.5 8.3 2.5 12.4 0.4 8.5 6.9 12.2 22.5 3.7 6.1 15.1
子是:” (The corrected sentence is:), identified also makes it vulnerable when the number of eval-
through common patterns input in the model out- uation samples is limited, such as ECSpell dataset,
put; 2) Redundant Punctuation: Many sentences which contains only 500 sentences for each sub-
in the dataset lack terminal periods, yet some mod- domain, and lacks of ability to detect subtle dif-
els inappropriately add them. To prevent incorrect ferences between models when evaluating on the
evaluations due to this discrepancy, we remove any same dataset, which a sentence contains multiple
added terminal period if the original sentence did errors.
not have one.
Character-level Correction F1 (C-F) Different
D Details of Evaluation from the sentence-level F1 score, the character-
level F1 score focuses on the correctness of each
D.1 Evaluation Metrics character in the sentence. Similar to the sentence-
In this work, we use the following metrics to evalu- level F1 score, the character-level F1 score also
ate the performance of our approach and the base- consists of two parts: precision (C-P) and recall
lines. (C-R). C-P is the proportion of correctly corrected
characters among all characters modified by the
Sentence-level Correction F1 (S-F) S-F consists model, and C-R is the proportion of correctly cor-
of two parts: precision (S-P) and recall (S-R): rected characters among all characters need to be
S-P ˆ S-R corrected.
S-F “ 2 ˆ . (10) Conventional character-level metrics of CSC are
S-P ` S-R
based on point-wise evaluation, which fall short
where S-P represents the proportion of correctly when models insert or delete characters, as they
corrected sentences among all sentences modified can inaccurately mark all subsequent characters as
by the model, and S-R represents the proportion of incorrect due to a single addition or deletion. To
correctly corrected sentences among all sentences overcome this, we implement Levenshtein algo-
need to be corrected. rithm to align the model output with the target
A sentence is considered correctly corrected if sentence and calculate the character-level metrics
and only if all errors in the sentence are fixed and based on the aligned results. This alignment-based
no new errors are introduced. This strict definition method provides a more reasonable evaluation of
makes the sentence-level F1 score rigorous, but character-level performance.
Input 商务部前头,11月底完成 the word “牵头” (qiāntóu, led by) is misspelled as
Reference 商务部牵头,11月底完成 “前头” (qiántóu, front) in the input sentence. Both
ReLM 商务部牵头,11月底完成 the ZSP and FSP baselines mistakenly put their at-
BC2 13B ZSP 商务部前面,11月底完成 tention on the character “前” (front) and incorrectly
BC2 13B FSP 商务部日前,11月底完成 correct “前头” to “日前” (a few days ago) and “前
BC2 13B OUR 商务部牵头,11月底完成 面” (front), respectively. Such corrections are not
Input 虎珀酸索莉那新片主要功能是什么 only implausible but also linguistically awkward.
Reference 琥珀酸索利那新片主要功能是什么 In contrast, the domain-general model ReLM and
ReLM 琥珀酸索莉那新片主要功能是什么 our approach successfully correct the misspelling.
BC2 13B ZSP 琥珀酸索利那新片主要功能是什么 In the second case (“What are the main functions
BC2 13B FSP 虎珀酸索莉那新片主要功能是什么 of Solifenacin Succinate Tablets”), the name of the
BC2 13B OUR 琥珀酸索利那新片主要功能是什么 drug “琥珀酸索利那新片” (Solifenacin Succinate
Tablets) is misspelled. To correct the misspelling,
Table 16: Qualitative examples of our approach and the the knowledge of the medical domain is required.
baselines. Corrections marked in “Blue” are correct, In this case, the ReLM model fails to correct the
while those in “Red” are incorrect.
misspelling, while the zero-shot prompting base-
line and our approach successfully correct it. It is
Sentence-level False Positive Rate (FPR) Both worth noting that the few-shot prompting baseline
sentence-level F1 score and character-level F1 also fails to correct the misspelling, which indicates
score overlook the cases where the model intro- that the inclusion of inappropriate examples may
duces unnecessary modifications to a de-facto cor- lead to worse performance.
rect sentence. To fill this gap, sentence-level False
Positive Rate (FPR) is proposed to measure the F More Discussions
proportion of sentences that are initially correct but F.1 Impact of the Pre-training Data
modified by the model.
There are two main factors that differentiate LLMs
D.2 Evaluation Settings and Conventions from simpler LMs: the scale of pre-training data
During evaluation, we remove all whitespaces and and the model size. The impact of model size on
convert all full-width punctuation to half-width the performance of LLMs has been discussed in
from the input and output sentences to guarantee a §5.1. In this subsection, we aim to investigate the
fair comparison.11 impact of pre-training data on the performance of
When evaluating the Lemon dataset, we ignore our approach.
all sentences where the input and output sentence We compare Qwen1.5, a recent LLM family,
lengths do not match, following the dataset’s con- with GPT2, which also has a causal LM (decoder-
vention. only) architecture. The GPT2 model family par-
tially overlaps in model size with the Qwen1.5
E More Results model family, but it was trained on a much smaller
dataset, CLUECorpusSmall. The CLUECorpusS-
E.1 Detailed Results mall dataset contains only about 5 billion characters
Due to the space limitation, we only present the and has limited diversity in text sources, including
average results of each dataset in the main text. only news, Wikipedia, forums, and comments.
The detailed results of each dataset are shown in As shown in Table 17, when the model sizes
Table 13, Table 14, and Table 15. are similar, the Qwen1.5 model family outperforms
the GPT2 model family on all three datasets. The
E.2 Qualitative Examples largest performance gap is observed on the Lemon-
We provide two qualitative examples to illustrate Nov dataset, where a smaller 463M Qwen1.5 model
the performance of our approach in Table 16. even outperforms a larger 1.5B GPT2 model by
In the first case (“Led by the Ministry of Com- 7.1% in the sentence-level correction F1 score.
merce, to be completed by the end of November”), This is because the Lemon-Nov dataset contains
11
texts from the novel domain, which is not included
BERT-based models often remove whitespaces during
tokenization and may convert full-width punctuation to half- in the CLUECorpusSmall dataset. These results
width when correcting spelling errors (e.g., ReLM). indicate that the scale and diversity of the pre-
Data rSighan 15 Lemon Nov ECSpell Odw
System Amount S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç
GPT2 1.5B Small 56.6 64.4 10.4 26.1 31.8 31.4 82.8 85.8 5.5
Qwen1.5 463M 56.3 63.5 10.0 33.2 40.2 22.2 84.7 89.9 3.8
Large
Qwen1.5 1.8B 58.3 65.3 10.3 35.6 42.3 19.9 90.3 92.8 1.7
Table 17: A brief comparison of the performance of LLMs of different sizes and pre-training data amounts on three
datasets.
training data are crucial for the performance of phenomenon can be attributed to the characteristics
our approach. of the ECSpell dataset, which, as pointed out by Wu
et al. (2023), contains a high proportion (more than
F.2 Comparison to the Supervised 70%) of error-correction pairs that never appeared
Fine-tuning Method in the training data. The supervised fine-tuning
In this subsection, we compare our approach with method is not effective in handling these unseen
the supervised fine-tuning method. error-correction pairs, whereas our approach can
However, we did not fine-tune the LLMs our- still correct them.
selves, as fine-tuning an LLM on the 34M synthetic
CSC data would be extremely time-consuming and F.3 Influence of Beam Size
computationally expensive. Additionally, the super- During searching the most likely correction se-
vised fine-tuning method typically requires careful quence, the beam search algorithm is used to avoid
hyperparameter tuning to achieve the best perfor- the exponential growth of the search space and the
mance, further increasing the computational cost. local minimum caused by greedy search. Knowing
Instead, we leverage the findings from Li et al. the impact of the beam size on the performance
(2023a), who fine-tuned the Baichuan2 7B and helps researchers to choose a proper beam size to
GPT2 models on the ECSpell dataset, and Hu et al. balance the trade-off between the performance and
(2024), who fine-tuned the Baichuan2 7B model the computational cost. The results are shown in
on the CSCD-NS dataset. Figure 5. Though the larger beam size consistently
The results are shown in Table 18. Compared leads to better performance, the improvement be-
to the BERT-based models, the supervised fine- comes marginal when the beam size is larger than 6.
tuning method is less effective in improving the
performance of causal LMs like GPT2 and recent F.4 Effectiveness of the Estimated Distortion
LLMs such as Baichuan2. Model
Our training-free approach even outperforms the The distortion model is a key component in our
supervised fine-tuning counterpart on the Med and approach. In this work, we utilize a minimal dis-
Odw sub-domains of the ECSpell dataset. This tortion model and directly estimate the distortion
rSighan 15 Lemon Nov 100 ECSpell Odw
70 50
90
60
40 80 S-F
50
70 C-F
30
40
60
30 20
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
Beam Size Beam Size Beam Size
Figure 5: The scores of Baichuan2 7B with different beam sizes. The solid lines represent the results of our
approach, and the dashed lines represent the results of the few-shot baseline. We can observe that larger beam sizes
may lead to worse C-F scores in few-shot settings.
Inference Speed (ms) System Ctx. S-Fæ C-Fæ FPRç CERç CERRæ
System
per Sent. per Char. No correction – – – 4.83 –
ReLM 14.4 0.4 Domain-Specific SOTAs
ZSP 899.8 22.2 Leng et al. (2021a) ✗ – – – 4.16 13.9
Baichuan2 13B FSP 1,057.4 26.1 Leng et al. (2021b) ✗ – – – 4.11 14.9
OUR 1,541.0 38.0 Leng et al. (2023) ✗ – – – 3.57 26.1
Domain-General SOTAs
Table 20: The inference speed of different models. ✗ 23.7 25.7 5.3 4.39 9.1
Finetuned BERT
✓ 18.2 19.5 1.8 4.43 8.3
✗ 22.6 25.5 5.4 4.43 8.3
Softmasked BERT
probabilities from the statistics of the Pseudo-Dev ✓ 19.8 21.4 1.9 4.39 9.1
dataset. Obviously, this estimation will be different ✗ 24.7 27.5 4.7 4.30 11.0
ReLM
✓ 17.7 18.6 2.5 4.50 6.8
from the true probabilities.
LLMs
To verify the effectiveness of the estimated dis- Baichuan2 OUR ✗ 34.8 43.1 3.8 3.68 23.8
tortion model, we conduct experiments comparing (13B) ✓ 41.7 51.3 3.0 3.29 31.9
the estimated distortion model with the true distor- Qwen1.5 OUR ✗ 28.7 37.4 7.1 4.10 15.1
tion model. The results are presented in Table 19. (14B) ✓ 38.0 48.5 4.4 3.44 28.8
The upper part of the table shows the difference InternLM2 OUR ✗ 33.8 42.6 4.1 3.70 23.4
(20B) ✓ 40.4 51.3 3.0 3.29 31.9
between the estimated distortion model and the
true distortion model. We can see that the esti- Table 21: Results of contextual enhanced spelling cor-
mated one is quite close to the true one, except for rection on AISHELL-1 dataset. ✓ denotes the results of
the Similar Shape distortion type. The lower part models taking 3 preceding sentences as the input prefix.
shows that the difference between the performance All the preceding context are also predicted by the same
is marginal, indicating that the estimated distortion ASR model.
model is sufficient for our approach to achieve a
good performance, and has good generalization
ZSP and FSP baselines, our approach is slower
ability across different datasets.
(1.71ˆ and 1.45ˆ, respectively), primarily due
F.5 Inference Speed to our immature implementation of the distortion
model, which can be further optimized to improve
We conducted a brief runtime analysis to evaluate
inference speed.
the inference speed of our approach. The analysis
was performed using a single NVIDIA A100 40GB
F.6 Context as New Knowledge
GPU with an Intel Xeon Gold 6248R (3.00GHz)
CPU. The batch size was set to 1 for all models, In Section 5.5, we used a toy example to demon-
and other hyperparameters were set to the same strate that our approach can introduce new knowl-
values as in the main experiments. edge into the LLM by merely modifying the input
The average inference speed of each model on prefix. However, in real-world scenarios, it is dif-
the ECSpell-Odw dataset is shown in Table 20. Due ficult to automatically extract the key characters
to the large model size and the autoregressive de- as we did in the toy example and ensure they are
coding process, LLMs are significantly slower than suitable for the input prefix. Luckily, sentences
the BERT-based ReLM model. Compared to the in real-world contexts are not isolated but are part
of a paragraph, and their preceding sentences can
provide valuable information for error correction.
Thus, we can treat the preceding context as new
knowledge and introduce it into the LLM.
Since existing datasets for CSC are composed of
isolated sentences, it is impossible to validate the
effectiveness of using the preceding context as new
knowledge on them. Therefore, we utilize the ASR
error correction dataset derived from AISHELL-1
(Bu et al., 2017), where the sentences are consecu-
tive and part of coherent passages. In this dataset,
(Leng et al., 2021a) used an ASR model to tran-
scribe the speech data, introducing spelling errors
naturally caused by the ASR system.
In addition to conventional CSC metrics, we
also report the Character Error Rate (CER)12
and Character Error Rate Reduction (CERR)13
to compare with the baseline models (Leng et al.,
2021a,b, 2023).
Specifically, we take the three preceding sen-
tences from the source side as the new knowledge:
12
CER calculates the number of insertions, deletions, and
substitutions edits required to transform the predicted se-
quence into the target sequence:
13
CERR represents the percentage of CER reduction com-
pared to the baseline model:
CERours
CERR “ 1 ´ .
CERbaseline