0% found this document useful (0 votes)

45 views22 pages

A Simple Yet Effective Training-Free Prompt-Free Approach

Uploaded by

601770313

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views22 pages

A Simple Yet Effective Training-Free Prompt-Free Approach

Uploaded by

601770313

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

A Simple yet Effective Training-free Prompt-free Approach

to Chinese Spelling Correction Based on Large Language Models

Houquan Zhou1, Zhenghua Li1B , Bo Zhang2, Chen Li2
Shaopeng Lai, Ji Zhang, Fei Huang2, Min Zhang1
2 2
1
School of Computer Science and Technology, Soochow University, China
[email protected], {zhli13,minzhang}@suda.edu.cn
2
DAMO Academy, Alibaba Group, China
{klayzhang.zb,puji.lc,laishaopeng.lsp,zj122146,f.huang}@alibaba-inc.com

Partially Generated Sentence:

Abstract jiù shì
<BOS> 明天就是周末了，又可以
is
This work proposes a simple training-free
Append: 跟
prompt-free approach to leverage large lan- Large Language Model
arXiv:2410.04027v1 [cs.CL] 5 Oct 2024

guage models (LLMs) for the Chinese spelling Next Token Prediction
correction (CSC) task, which is totally differ-
ent from all previous CSC approaches. The xiū xı̄ shuì gēn gēn

key idea is to use an LLM as a pure language 休息睡 ¨¨¨ 跟 ¨¨¨ 根 ¨¨¨

rest sleep with root
model in a conventional manner. The LLM 0.162 0.079 pLLM : 0.0168 ✔ 0.000002
goes through the input sentence from the be- Same
ginning, and at each inference step, produces
Unrelated Unrelated Pinyin 0.00039 Identical 1.9E-6
pDM : 0.023 0.962
a distribution over its vocabulary for deciding
the next token, given a partial sentence. To Relationship Evaluation
ensure that the output sentence remains faith- Minimal Distortion Model
ful to the input sentence, we design a minimal Input Sentence:
jiǔ shí gēn péng yǒu chū qù wán le
distortion model that utilizes pronunciation or 明天九十周末了，又可以根朋友出去玩了。
shape similarities between the original and re- ninety root

placed characters. Furthermore, we propose

Figure 1: An illustration of our approach. The correct
two useful reward strategies to address prac-
sentence should be “明天就是周末了，又可以跟朋
tical challenges specific to the CSC task. Ex-
友出去玩了。” (Tomorrow is the weekend, allowing
periments on five public datasets demonstrate
for going out to play with friends again.).
that our approach significantly improves LLM
performance, enabling them to compete with
state-of-the-art domain-general CSC models.
and has attracted a lot of attention in recent years
1 Introduction (Bao et al., 2020; Xu et al., 2021; Li et al., 2022;
Wu et al., 2023; Dong et al., 2024).
Given a Chinese character, there may exist many Recently, witnessing the success of large lan-
others with the same or similar pronunciations, guage models (LLMs), researchers try to leverage
or with similar shapes. This similarity can lead LLMs for the CSC task. These approaches fall
to incorrect character selection when using cer- into two categories: prompt-based and supervised
tain keyboard input methods. It is worth noting fine-tuning (SFT). However, their performance lags
that nowadays most Chinese users rely on Pinyin- behind non-LLM approaches by large margin.
based input methods. Besides, optical character The prompt-based approach relies on carefully
recognition (OCR) and automatic speech recog- designed prompts, using instructions with no exam-
nition (ASR) systems may also introduce errors ple (zero-shot) or a few examples (few-shot), and
during image/speech-to-text conversion. Such in- requires a capable LLM (typically ChatGPT) to
correct characters in texts degrade communication perform CSC (Li et al., 2023a; Dong et al., 2024).
efficiency, and sometimes even lead to misunder- However, spelling errors can sometimes make it
standing. very difficult for the LLM to correctly understand
As illustrated in Figure 1, the task of Chinese the original meaning of the sentence. As a result,
spelling correction (CSC) aims to correct each in- the LLM either ignores errors or replace an erro-
correct character in a sentence (Yu and Li, 2014), neous character to another erroneous one. More-
B Zhenghua Li is the corresponding author. over, extra strategies are required to help ChatGPT
ensure that the length of the output sentence is con- Type Example Proportion
sistent with the input sentence. Identical 机 (jı̄) 0.962
The SFT approach also uses prompts but, in con- Same Pinyin 基 (jı̄) 0.023
Similar Pinyin 七 (qı̄) 0.008
trast to the prompt-based approach, it continues
Similar Shape 仉 (zhǎng) 0.004
tuning the parameters of an LLM using CSC train-
Unrelated 能 (néng) 0.003
ing data (Li et al., 2023a). However, we argue that
the SFT approach has two weaknesses. First, the Table 1: Examples of the different distortion types of
fine-tuned LLM is limited to the CSC task. Sec- the corrected token “机” (jı̄). The distribution of the
ond, the fine-tuning procedure requires significant types is calculated from the development set.
computational resources, even when only a small
fraction of parameters are trained. how to model the score of the input and output
This work proposes a simple training-free sentence pair, i.e., scorepx, yq.
prompt-free framework to leverage LLMs for the Under a perspective of probabilistic modeling,
CSC task, consisting of two components: an LLM ppx, yq can be decomposed into two parts:
and a distortion model, as shown in Figure 1. The
key idea is using the LLM as a pure language model ppx, yq “ ppx | yq ppyq
(1)
in a conventional manner. The LLM goes through “ pDM px | yq pLLM pyq
the input sentence from the beginning, and at each
inference step, produces a distribution over its vo- The first part corresponds to a distortion model,
cabulary for deciding the next token, given a partial which captures the relationships between x and
sentence. The distortion model ensures that the y. In other words, it interprets how spelling errors
resulting sentence is faithful to the input sentence, transform y to x. Another important function of
i.e., remaining the same meaning, by capturing pro- the distortion model is to make sure that y repre-
nunciation or shape similarity between the original sents the same “meaning” as x, i.e., faithfulness.
and replaced characters. The second part corresponds to a large language
Our contributions are summarized as follows: model, which makes sure that y is fluent and cor-
rect from the language use perspective.
‚ We for the first time propose a simple yet
Please note that our use of LLMs is prompt-free.
effective training-free prompt-free framework to
We do not provide CSC-related instructions and
leverage LLMs for the CSC task, which is totally
examples as the prompt. More importantly, we do
different from all previous approaches.
not give the input sentence to LLMs. We use LLMs
‚ Tokens in the output vocabulary of LLMs vary
as pure traditional language models for evaluating
in length, i.e., a token may consist of one or multi-
next-token probabilities.
ple characters. To accommodate this, we propose a
length reward, which is very useful and can work 2.1 A Minimal Distortion Model
well with beam search decoding.
Our distortion model adopts character-level factor-
‚ LLM tends to prefer high-frequency tokens, ization:
leading to the over-correction issue. We propose a ÿ
faithfulness reward, which further encourages the log pDM px | yq “ log pDM pxi | yi q (2)
model to be faithful successfully. i
‚ Experiments on five public datasets demon- To further simplify the model, we do not com-
strate that our approach significantly improves the pute distortion probabilities for specific charac-
performance of LLMs in the CSC task and exhibits ter pairs, i.e., pc1 , c2 q. Instead, we first classify
remarkable domain generalization capabilities. pc1 , c2 q into one of five distortion types, denoted
Our code is available at https://github.com as typepc1 , c2 q. Then we use the probability of the
/Jacob-Zhou/simple-csc. type as the distortion probability of the character
pair:
2 Our Approach pDM pc1 | c2 q “ pptypepc1 , c2 qq (3)
Given an input sentence x “ x1 , x2 , ¨ ¨ ¨ , xn , Table 1 illustrates the distortion types. The pro-
where xi denotes a character, a CSC model out- portions are obtained from small subsets of popular
puts a sentence of the same length, denoted as CSC training data, described later in §3.1, and used
y “ y1 , y2 , ¨ ¨ ¨ , yn . The key to the CSC task is directly as distortion probabilities.
Please note that we claim our approach as BOS 要求公 ¨¨¨
要求是公 ¨¨¨
training-free, since the LLMs are used in an off- 姚✗ 师师 ¨¨¨
the-shelf manner and the distortion model only re- 约✗ 修✗ 是 ¨¨¨
lies on several frequency values, which can be eas- 药✗ 就✗ 式✗ ¨¨¨
ily counted from a small dataset. 妖✗ 实使✗ ¨¨¨
腰✗ 球✗ 实✗ ¨¨¨
Given pc1 , c2 q, we implement a simple rule- 于✗ 时✗ 公✗ ¨¨¨
based tool to decide the distortion type. Among (a) w/o Length Reward
the five types, “Similar Pinyin” and “Similar BOS 要求施工单位 ¨¨¨
Shape” are more complex to handle. More details 要是 ¨¨¨
are given in Appendix B. 姚✗ 师 ¨¨¨
约✗ 求 ¨¨¨
2.2 Next-token Probabilities from LLM 药✗ 实 ¨¨¨
妖✗ 时✗ ¨¨¨
Typically, the output vocabulary of an LLM con- 腰✗ 式✗ ¨¨¨
tains both single- and multi-character tokens. In 于✗ 施 ¨¨¨
other words, given a sentence y “ y1 ...yn , there (b) w/ Length Reward
exists many ways to segment it into a sequence of
Figure 2: A real example of the decoding process for the
tokens. We use t “ t1 ...tm to denote a specific input sentence “要求师公单位对...” (Requesting the
token-level segmentation of y, i.e., a path for the master unit to ...). Here, “施工” (shı̄gōng, construction)
LLM to generate the character sequence, where is misspelled as “师公” (shı̄gōng). Without the length
tj “ c1 . . . ck and k ě 1. Then, the log probability reward, the correct character “施” is fail to be select into
of y can be decomposed as: the beam.
ÿ
log pLLM pyq “ log pLLM ptj | tăj q (4)
j
may varies greatly in the number of characters gen-
erated so far. For instance, one candidate contains
After combining the distortion model, the proba- 5 characters, whereas another candidate contains 8.
bility of a partial output sentence is:
2.4 Length Reward
log ppx, tďj q “ log ppx, tăj q Our preliminary experiments show that the vanilla
` log pLLM ptj | tăj q approach, as described in Equation 5, produces un-
k
(5) satisfactory results. Detailed analysis shows that
ÿ
` log pDM pcr | xl`r q the paths explored in the beam search space are
r“1 dominated by single-character tokens, as shown in
Figure 2a. As we all know, multi-character tokens
where k “ lenptj q and l “ lenptăj q are the lengths are created by merging characters that frequently
(i.e., character number) of tj and tăj , respectively. occur together, capturing the most common pat-
2.3 Beam Search Decoding terns in the language. LLMs are trained for and,
in turn, very good at generating multi-character to-
During inference, the basic operation at step j is to kens. Therefore, it is counter-intuitive to deprive
select a token tj and append it to the current partial such capability from LLMs.
sequence tăj . We follow the standard practice, To handle the issue, we design a simple length
and adopt beam search decoding, that only retains reward so that the model favors and keeps multi-
the top-K candidates at each decoding step for char tokens during beam search:
computational efficiency.
In particular, one technical detail is closely re- scorepx, tďj q “ scorepx, tăj q
lated with our length reward strategy and thus wor- ` log pLLM ptj | tăj q
thy of further discussion. As discussed above, most k
LLMs generate sentences at token-level and one
ÿ (6)
` log pDM pcr | xl`r q
token may contain either a single character or mul- r“1
tiple characters. This implies that the beam search ` α ˆ plenpti q ´ 1q
procedure is aligned according to token numbers
rather than character positions. In other words, at where α is a hyperparameter for balancing the
any given inference step, candidates in the beam weight of the length reward, considering that
0.064
买 (Unrelated) probabilities.1 If the entropy is high, meaning
to buy
0.029
书店 (Unrelated) that the LLM is uncertain about the next token,
Bookstore the distortion model, along with the length reward,
小明想去 ¨¨¨
Xiaoming Wants to go 0.0039 苏州 ✘: Over-correction will play a more important role in deciding the
Suzhou, Jiangsu
¨¨¨ next token. From Table 1, we can see that the
宿州 ✔ “Identical” type has a much higher probability
0.000003
Suzhou, Anhui
¨¨¨
than others. That is, the distortion model always
favors the original input tokens.
Figure 3: A real example of the probabilities for the
next token, given the partial sequence “小明想去” from 3 Experimental Setup
the sentence “小明想去宿州” (Xiaoming wants to go
to Suzhou, Anhui). 3.1 Datasets.
Pseudo development set Since there is no pub-
the other two components use log probabilities, licly available, manually labeled, domain-general
whereas the length reward uses numbers directly. development set for CSC, we have chosen to split a
Please note that we use scorep¨q instead of pp¨q, small portion of the existing synthetic training data
since the values are no longer probabilities. for hyperparameter tuning, naming it Pseudo-Dev.
As shown in Figure 2b, thanks to the length re- Specifically, we use 1,000 sentences each from the
ward, the correct token “施工单位” (construction synthetic training data of Hu et al. (2024) and Wang
unit) is now ranked within the top-K candidates. et al. (2018) as our development set.

2.5 Faithfulness Reward Real-world test sets We perform experiments

across five distinct CSC datasets: Sighans (Wu
Under our prompt-free use, the LLM component is et al., 2013; Yu et al., 2014; Tseng et al., 2015),
unaware of the input sentence, and only focuses on CSCD-NS (Hu et al., 2024), MCSCSet (Jiang
the fluency and correctness of the output sentence et al., 2022), ECSpell (Lv et al., 2023), and Lemon
from the language use perspective. (Wu et al., 2023), covering a broad spectrum of
We observe that our approach, even with the domains and genres. The details and statistics of
length reward, tends to over-correct the input sen- these datasets can be found in Appendix C.1. For
tence, i.e., changing its original meaning. Figure 3 Sighans, we utilize the revised versions released
gives an example. Given the partial output sen- by Yang et al. (2023b), which have been manu-
tence, i.e., “小明想去” (Xiaoming wants to go to), ally verified and corrected for errors of the original
the LLM component gives a probability of 0.0039 datasets, and name them as rSighans for clarity.
to “苏州” (sūzhōu), which is a very famous city
in Jiangsu Province. In contrast, it gives a much Selected datasets for analyses Given the ab-
lower probability of 3 ˆ 10´6 to the original input sence of a domain-general development set for
token, i.e., “宿州” (sùzhōu), which is a less famous CSC and the potential limitations of the Pseudo-
city in Anhui Province. The distortion model fails Dev set in representing real-world data, we conduct
to remedy such great gap. As the result, our ap- in-depth analyses on three distinct datasets to cover
proach adopts the “correction”. However, under a broad spectrum of language use. These include
such circumstances, it is better to reserve the origi- errors made by Chinese learners (rSighan 15), col-
nal tokens. loquial and diverse text from novels (Lemon Nov),
To mitigate this issue, we introduce a faithful- and formal and standard text from official docu-
ness reward: ments (ECSpell Odw).

scorepx, tďj q “ scorepx, tăj q 3.2 Evaluation Metrics.

` log pLLM ptj | tăj q We follow the convention to use the sentence-level
¨ řk
correction F1 (S-F) score as the main evaluation
˛
r“1 log pDM pcr | xl`r q
` p1 ` HLLMp¨q q ˆ ˝ ` ‚ metric. Besides, we also report character-level
α ˆ plenpti q ´ 1q 1
Since LLMs have different output vocabularies V, we
(7) divide the entropy by log |V|, which can be understood as the
where HLLMp¨q denote the entropy of next-token maximum entropy, and the value will fall into r0, 1s.
rSighans CSCD-NS MCSCSet ECSpell Lemon
System
S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç
Domain-Specific SOTAs (Trained on in-domain gold-standard data of each dataset)
ReaLiSe: 69.3 80.7 10.1 41.4 44.2 27.6 17.8 27.6 12.0 34.9 45.4 13.7 28.2 31.6 19.1
Hu et al. (2024) – – – 74.4 76.6 – – – – – – – – – –
Jiang et al. (2022) – – – – – – 80.9 – – – – – – – –
Liu et al. (2024) – – – – – – – – – 85.7 – 5.4 – – –
Domain-General SOTAs (Trained on about 34M synthetic CSC data)
Finetuned BERT 47.5 57.5 16.9 52.0 53.9 25.7 35.3 48.5 7.5 57.1 64.9 6.4 48.0 49.3 13.1
Softmasked BERT 47.7 57.4 15.1 51.0 53.4 28.5 35.3 48.5 8.1 57.6 66.2 7.6 47.2 48.8 13.1
ReLM 47.3 56.9 9.6 49.5 51.6 29.3 37.8 50.2 6.8 59.3 68.4 8.6 50.2 51.3 11.8
LLMs (without CSC-specific training)
ZSP 19.0 18.4 49.1 22.6 14.5 35.3 13.6 8.0 77.5 34.5 22.3 30.3 17.5 9.8 40.9
Baichuan2 FSP 31.8 38.5 21.4 35.7 32.7 10.5 42.6 47.1 4.4 56.8 53.1 5.8 35.1 25.2 9.5
(13B)
OUR 59.1 70.9 10.4 63.2 66.2 16.5 66.0 76.9 1.7 84.5 89.8 4.9 53.2 56.2 9.1
ZSP 29.0 31.4 41.1 34.3 31.3 24.5 40.2 45.4 3.8 50.9 49.0 14.4 31.8 26.8 16.1
Qwen1.5 FSP 34.3 37.9 26.2 42.9 38.7 10.4 40.5 44.3 3.1 59.0 58.2 5.9 37.2 30.2 9.9
(14B)
OUR 54.4 68.0 17.2 52.6 57.7 25.8 61.1 72.6 3.1 81.6 88.2 6.5 46.3 50.8 14.1
ZSP 31.0 30.4 57.3 34.9 29.2 40.6 19.0 12.5 80.5 45.2 37.5 31.6 32.8 26.5 27.8
InternLM2 FSP 35.2 38.8 31.7 39.4 35.1 22.4 33.6 32.6 20.4 54.3 49.8 15.7 35.9 28.9 17.3
(20B)
OUR 57.1 70.0 12.6 60.7 64.1 19.7 63.2 72.9 2.6 82.4 88.8 5.1 49.8 53.7 10.7

Table 2: Main Results. :: We reran the released code of ReaLiSe (Xu et al., 2021), along with their released
models, to obtain the results. ReaLiSe, was trained on the in-domain, gold-standard data of the Sighans dataset and
represents a SOTA model for it. The numbers in gray represent the out-of-domain results for ReaLiSe. Detailed
results of each sub-domain are provided in Appendix E.1.

correction F1 (C-F) and sentence-level false posi- manually annotated data, we report results from
tive rate (FPR) to provide a more complete view models specifically trained on it, serving as another
of the model performance. Details of the evaluation reference point.
metrics can be found in Appendix D.
3.4 Selection of LLMs
3.3 Baselines We conduct experiments on three open-source
We compare our approach against prompt-based LLMs: Baichuan2 (Yang et al., 2023a), Qwen1.5
method under two settings: zero-shot prompting (Bai et al., 2023), and InternLM2 (Cai et al., 2024).
(ZSP) and few-shot prompting (FSP). For few-shot For the main results, we select models with param-
settings, we select 10 examples from the Pseudo- eter sizes ranging from 10B to 20B to ensure that
Dev. The details of the prompts can be found in the LLMs have sufficient zero-shot and few-shot
Appendix C.2, and the example selection strategy capabilities for meaningful comparisons. Addi-
is described in Appendix C.3. During inference, tionally, we report the ZSP and FSP results of the
we adopt the greedy decoding strategy.2 widely recognized best-performing LLM family,
To provide a more comprehensive compari- GPT, including GPT-3.5 and GPT-4.
son, we also present results from state-of-the-art To simplify the analysis, we select the Bai-
domain-general CSC models trained on 34 million chuan2 7B as a representative model to investigate
pairs of synthetic CSC data for reference. These the impact of components in our approach.
models include Finetuned BERT (Devlin et al.,
3.5 Hyperparameters of Our Approach
2019), Softmasked BERT (Zhang et al., 2020),
and ReLM (Liu et al., 2024).3 We use the “Base” version of each LLM family.
Additionally, for datasets that have in-domain The distortion probabilities of distortion model
were derived from the statistics of the Pseudo-Dev
2
We observe that the improvement of beam search is dataset. We tuned α on Baichuan2 7B using the
marginal and sometimes even detrimental. Pseudo-Dev dataset. Eventually, α was set to 2.5
3
The results of these models were obtained by running
the released code along with the corresponding checkpoints for all experiments. During inference, we adopt
provided at https://github.com/gingasan/lemon.git. beam search with a beam size of 8.
System S-Fæ S-Pæ S-Ræ C-Fæ C-Pæ C-Ræ FPRç System
rSighan 15 Lemon Nov ECSpell Odw
rSighan 15 S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç
ReLM 55.5 61.1 50.8 61.0 78.5 49.9 9.5 7B 59.8 68.2 8.0 43.2 47.7 13.6 89.7 93.0 1.3
BC2
ZSP 42.0 41.7 42.3 47.8 42.5 54.6 25.8 13B 59.6 67.3 8.3 43.5 47.9 13.0 92.0 93.8 0.4
GPT3.5 0.5B 56.3 63.5 10.0 33.2 40.2 22.2 84.7 89.9 3.8
FSP 41.7 42.0 41.4 48.4 44.5 53.2 23.4
1.8B 58.3 65.3 10.3 35.6 42.3 19.9 90.3 92.8 1.7
ZSP 43.5 38.1 50.8 49.9 40.2 66.0 47.5
GPT4 4B 58.4 66.8 10.0 35.9 42.3 21.1 88.4 91.1 3.4
FSP 48.7 44.2 54.4 52.9 44.0 66.3 38.8 Q1.5
7B 59.4 67.0 8.5 39.0 44.7 19.0 87.1 91.4 3.4
BC2 13B┌┐ 59.6 66.5 54.0 67.3 78.3 59.0 8.3 14B 57.6 66.0 10.2 36.4 42.6 21.2 87.4 91.6 2.9
Q1.5 14B OUR 57.6 62.5 53.4 66.0 74.1 59.4 10.2 32B 57.2 65.8 10.0 36.6 42.2 19.4 88.2 91.9 2.9
IL2 20B└┘ 60.5 67.2 55.0 67.8 78.7 59.6 8.3 1.8B 55.3 64.0 12.2 33.2 40.1 22.6 88.3 91.0 2.1
Lemon Nov (1000) IL2 7B 58.1 65.5 10.2 38.8 44.2 18.0 89.3 92.0 2.1
ReLM 36.4 46.7 29.8 36.0 49.2 28.3 14.3 20B 60.5 67.8 8.3 40.5 45.3 15.1 91.1 93.8 0.4
ZSP 19.1 20.8 17.7 19.8 17.9 22.3 29.2
GPT3.5 Table 4: Ablation results of model size.
FSP 25.5 31.4 21.4 24.9 27.2 23.0 19.6
ZSP 30.6 28.4 33.1 33.4 26.9 44.1 33.5
GPT4
FSP 42.7 41.4 44.0 43.1 38.9 48.3 27.4
BC2 13B┌┐ 45.3 53.7 39.1 49.1 57.0 43.2 13.1 datasets. The results indicate that our approach
Q1.5 14B OUR 38.2 41.7 35.3 43.7 44.5 43.0 21.8 has a better generalization across different domains
IL2 20B└┘ 42.8 49.9 37.5 46.4 52.8 41.4 15.3 and genres than the current domain-general SOTAs.
ECSpell Odw However, our approach still largely lags behind
ReLM 66.5 67.5 65.6 73.0 86.4 63.1 7.1 the domain-specific SOTAs trained on the gold-
ZSP 58.2 62.5 54.5 61.0 62.7 59.3 4.6
GPT3.5 standard labeled data of each dataset.
FSP 59.3 64.1 55.2 60.7 62.4 59.0 2.4
ZSP 73.1 73.0 73.3 77.3 75.5 79.2 5.0 Compared to the GPT family, our approach con-
GPT4
FSP 73.2 73.5 72.9 78.5 78.3 78.7 5.0 sistently outperforms GPT3.5 on all three datasets,
BC2 13B┌┐ 92.0 94.4 89.7 93.8 95.6 92.1 0.4 and achieves better performance than GPT4 in most
Q1.5 14B OUR 87.4 88.6 86.3 91.6 91.8 91.3 2.9
cases. However, our approach may exhibit a lower
IL2 20B└┘ 91.1 92.9 89.3 93.8 95.9 91.8 0.4
C-R compared to GPT4, indicating that we might
Table 3: The comparison to GPT family on the rSighan miss some errors that GPT4 can correct.
15, Lemon Nov, and ECSpell Odw datasets. The
version of GPT3.5 is ‘gpt-3.5-turbo-0125’, GPT4 is 5 Discussion
‘gpt-4-0613’. BC2 is short for Baichuan2, Q1.5 for
Qwen1.5, and IL2 for InternLM2. 5.1 Impact of the Size of the LLM
First, we investigate the impact of the LLM size on
4 Main Results the performance of our approach.
As shown in Table 4, in general, larger LLMs
We present the main results in Table 2 and the com- tend to perform better than smaller ones within
parison to the GPT family in Table 3. Conducting a the same model family. However, the Qwen1.5
comprehensive evaluation of the GPT family is ex- model family is an exception: the performance
pensive, so we limit the comparison to a small-scale improvement becomes marginal when the model
study, focusing on the three datasets mentioned in size exceeds 1.8B parameters and even decreases
Section 3.1.4 Moreover, several qualitative exam- when the model size reaches 7B.
ples are provided in Appendix E.2 to illustrate the When comparing the performance of models of
performance of our approach. the same size across different model families, we
After applying our approach, all three LLM fam- find that the Baichuan2 family generally outper-
ilies outperforms their prompt-based counterparts forms the other two model families.
on all five datasets by a large margin.
Compared to the recent state-of-the-art domain- 5.2 Effectiveness of the Distortion Model
general CSC models, which are trained on 34M To investigate the effectiveness of the minimal dis-
synthetic CSC data, our approach also achieves tortion model, we first remove the distortion model
competitive or even superior performance on most pDM px | yq from the decoding process. Alter-
datasets, especially on the MCSCSet and ECSpell natively, we adopt a constrained text generation
4
(CTG) approach to correct the input sentence. For
The original Lemon-Nov dataset includes 6,000 sentences,
which is excessively large for our scope. Therefore, we se- each step, we limit the vocabulary to tokens that
lected the first 1,000 sentences for this comparison. are related to the corresponding characters in the
System S-Fæ S-Pæ S-Ræ C-Fæ C-Pæ C-Ræ FPRç System S-Fæ S-Pæ S-Ræ C-Fæ C-Pæ C-Ræ FPRç
rSighan 15 rSighan 15
CTG 6.7 5.3 9.1 7.7 4.2 47.7 90.0 Vanilla 18.0 15.9 20.6 20.7 14.3 37.6 52.9
OUR 59.8 66.0 54.7 68.2 77.8 60.6 8.0 w/ LR +39.4 +43.4 +35.0 +43.7 +53.3 +23.9 -38.4
- DT -7.7 -12.6 -3.9 -7.1 -15.7 -0.3 +9.4 w/ FR +3.8 +6.2 +0.8 +5.4 +8.3 -6.6 -19.3
- DT: -12.3 -18.2 -7.5 -9.8 -20.5 -1.2 +11.1 w/ Both +41.9 +50.1 +34.1 +47.4 +63.5 +23.0 -44.8
Lemon Nov Lemon Nov
CTG 0.7 0.5 1.1 1.4 0.7 22.5 96.2 Vanilla 19.4 18.0 20.9 23.6 17.1 38.3 38.5
OUR 43.2 52.2 36.9 47.7 55.5 41.9 13.6 w/ LR +17.1 +19.5 +14.6 +19.0 +21.9 +8.6 -13.7
- DT -12.3 -20.5 -6.8 -10.0 -20.8 -0.7 +13.9 w/ FR +9.0 +13.5 +4.7 +8.5 +13.5 -4.5 -18.8
- DT: -11.7 -20.5 -5.5 -9.7 -21.8 -1.6 +14.7 w/ Both +23.9 +34.2 +16.0 +24.1 +38.4 +3.6 -25.0
ECSpell Odw ECSpell Odw
CTG 29.3 24.5 36.3 21.4 12.4 79.5 52.9 Vanilla 65.3 65.3 65.3 70.4 65.4 76.2 10.1
OUR 89.7 91.6 87.8 93.0 95.3 90.8 1.3 w/ LR +25.4 +26.9 +24.0 +22.5 +28.5 +15.6 -9.7
- DT -4.0 -4.6 -3.4 -3.9 -5.8 -2.2 0.0 w/ FR +4.7 +11.2 -0.8 +7.5 +19.7 -4.5 -6.7
- DT: -16.3 -16.9 -15.7 -12.7 -14.5 -10.9 +2.5 w/ Both +24.4 +26.4 +22.5 +22.6 +29.9 +14.6 -8.8

Table 5: Ablation results of distortion model on Table 6: Ablation results of Baichuan2 7B. “LR” and
Baichuan2 7B. “CTG” means constrained text gener- “FR” represent “length reward” and “faithfulness reward”
ation. “-DT” represents that we do not distinguish respectively. “Both” means using both length reward
Same Pinyin, Similar Pinyin, and Similar Shape, and faithfulness reward.
and treat them as Related distortion. “-DT: ” represents
using the confusion set from Wang et al. (2018) to iden-
tify the Related distortion. faithfulness reward. The ablation study results of
the two rewards are shown in Table 6.
The results show that the length reward signifi-
input sentence,5 and let the model select the most cantly improves performance on all three datasets.
likely token from the constrained vocabulary. The This improvement can be attributed to increases in
results are shown in the “CTG” column in Table 5. both precision and recall, indicating that the length
We can see that the CTG performs poorly on all reward is crucial to our approach. The faithful-
datasets. This is because a Chinese character may ness reward mainly contributes to improving preci-
have many similar characters. Without the distor- sion, and it may slightly reduce recall. Overall, the
tion model, the model is prone to replacing the faithfulness reward balances the trade-off between
original character with a higher-frequency similar precision and recall, leading to a higher F1 score.
character, leading to a large number of errors. The combination of the two rewards can achieve
Next, we investigate the impact of the distortion better performance than using them separately, es-
type by treating three types of related but not identi- pecially when datasets contain less formal text,
cal distortions as a single distortion type. As shown more colloquial expressions, and more diverse
in the “- DT” column in Table 5, the performance named entities.
drops significantly but not as severely as when re-
moving the distortion model. This performance 5.4 Does Our Approach Work Well on
drop is mainly due to a decrease in precision. Simpler LMs?
We also examine the effectiveness of our rule-
Though our primary focus is on the performance
based tool for identifying related distortions. We
of our approach on LLMs, the language model
replace our rule-based tool with the confusion set
term of Equation 1 can be substituted with simpler
from Wang et al. (2018) to identify the related dis-
models, such as n-gram models, masked language
tortion. The results in the “- DT: ” column in Table 5
models, or small-scale causal language models. In
show that the confusion set from Wang et al. (2018)
this subsection, we investigate the performance of
is less effective than our rule-based tool, leading to
our approach using these simpler language models.
more severe performance degradation.
The LMs we investigate include: n-gram LM:
5.3 Impact of Two Rewards KLM,6 a 5-gram language model trained on the Chi-
In this work, we propose two rewards to optimize nese Gigaword corpus; Masked LM: BERT,7 a
the decoding process: the length reward and the bidirectional language model pre-trained using the
5 6
Classified as Identical, Same Pinyin, Similar Pinyin, shibing624/chinese-kenlm-klm
7
or Similar Shape. bert-base-chinese
rSighan 15 Lemon Nov ECSpell Odw System S-Fæ C-Fæ FPRç
System
S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç ORI 35.3 48.5 7.5
Finetuned BERT
BC2 13B 59.6 67.3 8.3 43.5 47.9 13.0 92.0 93.8 0.4 w/ k +0.7 +1.7 +0.1
Q1.5 14B 57.6 66.0 10.2 36.4 42.6 21.2 87.4 91.6 2.9 ORI 35.3 48.5 8.1
IL2 20B 60.5 67.8 8.3 40.5 45.3 15.1 91.1 93.8 0.4 Softmasked BERT
w/ k +1.2 +2.1 -0.5
KLM 29.3 38.9 33.8 5.8 9.4 65.8 58.3 65.3 23.5 ORI 37.8 50.2 6.8
BERT 110M 31.3 34.0 0.2 13.3 12.5 0.6 59.1 63.6 0.0 ReLM
w/ k +0.9 +1.9 -0.2
GPT2 102M 55.0 64.7 8.1 26.1 30.8 28.4 78.6 85.0 5.4 OUR 66.0 76.9 1.7
Baichuan2 13B
w/ k +5.1 +5.4 -0.2
Table 7: Results of applying our approach to simpler OUR 61.1 72.6 3.1
LMs. Qwen1.5 14B
w/ k +9.1 +8.8 -1.0
OUR 63.2 72.9 2.6
InternLM2 20B
w/ k +4.8 +5.4 -0.0
mask filling task and next sentence prediction task;
Small causal LM: GPT2,8 a small-scale causal lan- Table 8: The results of introducing new knowledge by
guage model (about 102M parameters) trained on adding a prefix k to the input on the MCSCSet. “ORI”
the CLUECorpusSmall (about 5B characters). denotes the original input without any prefix.
The results are shown in Table 7. From these
results, we can see that our approach also works knowledge into the LLM by adding a simple input
with simpler LMs. In the ECSpell-Odw dataset, our prefix k “ “患者提问：” (“A patient asks:”).
approach enables simpler language models (LMs) The results in Table 8 demonstrate that introduc-
to achieve sentence- and character-level correction ing new knowledge into the LLM by merely modi-
F1 scores higher than 50% and 60%, respectively. fying the input prefix can significantly improve the
However, the performance of our approach on sim- model’s performance on the CSC task.
pler LMs still lags significantly behind that of the
large language models (LLMs), highlighting the We provide a real case from the MCSCSet to
importance of the scale of pre-training data and explain why this method works.
model size. Consider the sentence “未挨前兆” (wèi āi qián
zhào, “without being near any prior warnings”),
5.5 How to Introduce New Knowledge into which should be corrected to “胃癌前兆” (wèi ái
Our Approach? qián zhào, “early symptoms of stomach cancer”)
in the medical domain. This sentence contains
The LLM part of our approach offers a straightfor-
only four characters, insufficient to provide enough
ward way to incorporate new knowledge without
context for accurate correction, even for humans.
the need for further training, by adding some
CSC models fail to correct this sentence or sug-
text that describes the new knowledge as an in-
gest incorrect corrections, such as “未提前兆”
put prefix.
(wèi tí qián zhào, “did not provide prior warn-
Given the new knowledge k, Equation 1 can be
ings”) or “未按前兆” (wèi àn qián zhào, “not ac-
adjusted from ppx, yq to ppx, y | kq. We then have:
cording to the prior warnings”). However, if we
add the prefix “患者提问：” (“A patient asks:”),
ppx, y | kq “ ppx | y, kq ppy | kq
(8) which provides the knowledge that the sentence is
« pDM px | yq pLLM py | kq, a patient’s question about a medical condition, the
model can correctly predict “胃癌前兆”.
where, by assuming x and k are conditionally in-
In addition to this simple experiment, we also
dependent given y, we approximate ppx | y, kq as
provide an experiment in Appendix F.6 to show
pDM px | yq. The second term, pLLM py | kq, can be
that we can use the context as new knowledge to
calculated by the LLM using the input prefix k.
improve the performance of the CSC model in real-
To illustrate this point, we conducted a simple
world applications.
experiment introducing domain and text format
information as new knowledge into our approach. 5.6 More Discussions
We chose the MCSCSet for this experiment, as the
sentences in it share a common characteristic: they Due to space constraints, some interesting discus-
are questions from patients. We can introduce this sions have been moved to the Appendix. These
include: a discussion on how the pre-training data
8
uer/gpt2-chinese-cluecorpussmall of the LLM affects the performance (F.1); a compar-
ison between our approach and the SFT approach (2023a) were the first to investigate the prompt-
(F.2); an analysis of the influence of beam size on based approach under various settings. Building
the performance (F.3); an exploration of whether on this work, Dong et al. (2024) proposed enrich-
the imperfect estimation of the distortion model ing prompts with additional information, such as
impacts the performance (F.4); and a brief runtime pronunciation and character glyphs. Compared
analysis (F.5). to the prompt-based approach, the SFT-based ap-
proach has been shown to be more effective (Li
6 Related Works et al., 2023a). However, the performance of SFT-
based LLMs still falls significantly behind pre-
6.1 Chinese Spelling Check
LLM methods. Li et al. (2024) argue that this
Previous research on the CSC task can be divided underperformance is due to the mixed character-
into three eras, accompanied with paradigm shift. word tokenization used by LLMs for Chinese text.
To address this issue, they suggest replacing mixed
The Early Unsupervised Era Early CSC ap-
tokenization with character-level tokenization be-
proaches mainly utilized unsupervised pipeline sys-
fore training LLMs on the CSC dataset.
tems (Yeh et al., 2013; Yu et al., 2014; Yu and Li,
In contrast to these methods, our approach re-
2014; Huang et al., 2014; Xie et al., 2015). These
quires neither prompts nor additional training.
systems typicaly act in three main steps: error de-
tection, candidate correction generation from a con-
fusion set, and candidate ranking using a statistical 6.2 Decoding Methods of LLMs
n-gram language model. Intervening in the decoding process is a common
approach to improve LLMs’ task-specific perfor-
The Supervised Learning Era By 2018, the ad-
mance. There are two popular approaches in this
vent of techniques for automatically generating
category: Contrastive decoding and Constrained
pseudo-labeled data had begun to address the chal-
decoding. Contrastive decoding (Li et al., 2023b)
lenge of data scarcity in CSC (Wang et al., 2018),
refines the output probabilities by comparing the
marking a shift in the paradigm of CSC research to-
output probabilities of expert and amateur mod-
wards a supervised learning era dominated by deep
els (O’Brien and Lewis, 2023; Shi et al., 2023).
neural networks. This era saw researchers explor-
Constrained decoding, on the other hand, uses con-
ing various avenues to enhance CSC performance.
straints to guide the decoding process, making the
Some focused on finding better model architectures
output more aligned with the task-specific require-
(Zhang et al., 2020; Zhu et al., 2022), while oth-
ments (Wang et al., 2023; Geng et al., 2023).
ers delved into more effective training strategies
(Liu et al., 2022; Wu et al., 2023; Liu et al., 2024). Our work is closely related to the constrained
Additionally, there was an effort to enrich models decoding approaches, where a distortion model is
with information beyond text, such as phonetic or used to influence the LLM decoding process.
visual features (Cheng et al., 2020; Xu et al., 2021;
Li et al., 2022; Liang et al., 2023). 7 Conclusion
Similar to our work, Wu et al. (2023) also decom-
posed ppx | yq into two parts to improve CSC per- In this work, we propose a simple, training-free,
formance. However, they achieved this by adding and prompt-free approach to leverage LLMs for
an auxiliary training loss. Our work stands out by the CSC task. Two components, a large language
using an off-the-shelf LLM as the backbone and model and a minimal distortion model, co-operate
a minimal distortion model to achieve good CSC to correct spelling errors. We alleviate the local
performance without any additional training. optima problem and over-correction issue, with
two simple strategies, length reward and faithful-
The Era of LLMs Our work represents an ini- ness reward, respectively. Our comprehensive
tial foray into what can be considered the third era experiments have shown that our approach sig-
of CSC research: the era of LLMs. This phase nificantly improves LLM performance. Through
explores the potential of LLMs in addressing the our approach, LLMs demonstrate remarkable do-
CSC task. As discussed in the introduction, related main generalization capabilities, surpassing SOTA
studies in this era fall into two main categories: domain-general CSC models, that are trained on
prompt-based and supervised fine-tuning. Li et al. extensive synthetic CSC data, on most datasets.
Limitations References
Feasibility The scope of this study is limited to Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang,
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei
the task of Chinese spelling correction, which is a Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin,
subset of text error correction. Most of our design Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu,
choices are tailored to the characteristics of Chinese Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren,
and the specific requirements of the CSC task. Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong
Tu, Peng Wang, Shijie Wang, Wei Wang, Sheng-
However, our approach has the potential to be guang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang,
directly applied to some other languages. For ex- Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu,
ample, in Japanese and Korean, we can also cate- Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingx-
gorize errors into phonetic similarities, such as (や, uan Zhang, Yichang Zhang, Zhenru Zhang, Chang
ya)-(な, na) in Japanese or (ᄒ ᅮ, hu)-(ᄇ ᅮ, bu) in Ko- Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang
Zhu. 2023. Qwen technical report. ArXiv preprint,
rean, and shape similarities, like (ュ, yu)-(ェ, e) in abs/2309.16609.
Japanese. For languages using a phonetic writing
system, like English, minor adjustments such as Zuyi Bao, Chen Li, and Rui Wang. 2020. Chunk-based
adding INSERT, DELETE, and REORDER operations Chinese spelling check with global optimization. In
Proceedings of EMNLP, pages 2031–2040, Online.
will be sufficient to make it work.
Comparatively, handling complex text errors that Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao
involve grammar, semantics, or pragmatics, are Zheng. 2017. AISHELL-1: an open-source man-
more challenging. To tackle these errors, one could darin speech corpus and a speech recognition base-
line. In 20th Conference of the Oriental Chapter of
design an appropriate distortion model, though it
the International Coordinating Committee on Speech
might necessitate the adoption of more intricate Databases and Speech I/O Systems and Assessment,
rules or the implementation of a model based on O-COCOSDA 2017, Seoul, South Korea, November
neural networks. In our future work, we aim to 1-3, 2017, pages 1–5.
explore ways that would allow our approach to
Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen,
handle these complex errors. Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi
Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan,
Computational Cost Our approach requires the Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe
use of LLMs, which introduces additional com- Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He,
putational costs. However, many existing tech- Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao,
niques, such as quantization (Frantar et al., 2022; Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li,
Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hong-
Lin et al., 2024), pruning (Ma et al., 2023; Zhu wei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu,
et al., 2024), distillation (Hsieh et al., 2023), and Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv,
more efficient framework implementations (Dao, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang
2023; Yang et al., 2024), can be directly applied to Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, Fukai
Shang, Yunfan Shao, Demin Song, Zifan Song, Zhi-
our method to reduce these costs. hao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang,
Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang,
Acknowledgements Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen
Weng, Fan Wu, Yingtong Xiong, Chao Xu, Ruil-
First and foremost, we would like to express our iang Xu, Hang Yan, Yirong Yan, Xiaogui Yang,
deepest gratitude to all anonymous reviewers for Haochen Ye, Huaiyuan Ying, Jia Yu, Jing Yu, Yuhang
their invaluable time and constructive comments Zang, Chuyu Zhang, Li Zhang, Pan Zhang, Peng
Zhang, Ruijie Zhang, Shuo Zhang, Songyang Zhang,
on our paper. We would also like to thank Chen Wenjian Zhang, Wenwei Zhang, Xingcheng Zhang,
Gong, Tong Zhu, Shilin Zhou, and Yu Zhang for Xinyue Zhang, Hui Zhao, Qian Zhao, Xiaomeng
their help in polishing our paper. Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo,
This work was supported by National Natural Yicheng Zou, Xipeng Qiu, Yu Qiao, and Dahua Lin.
2024. Internlm2 technical report. ArXiv preprint,
Science Foundation of China (Grant No. 62176173 abs/2403.17297.
and 62261160648), Alibaba Group through Al-
ibaba Innovative Research Program, and a Project Xingyi Cheng, Weidi Xu, Kunlong Chen, Shaohua
Funded by the Priority Academic Program Devel- Jiang, Feng Wang, Taifeng Wang, Wei Chu, and Yuan
Qi. 2020. SpellGCN: Incorporating phonological and
opment (PAPD) of Jiangsu Higher Education Insti- visual similarities into language models for Chinese
tutions. spelling check. In Proceedings of ACL, pages 871–
881, Online.
Tri Dao. 2023. Flashattention-2: Faster attention with automatic speech recognition. In Proceedings of
better parallelism and work partitioning. ArXiv EMNLP, pages 4328–4337, Punta Cana, Dominican
preprint, abs/2307.08691. Republic.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Yichong Leng, Xu Tan, Linchen Zhu, Jin Xu, Renqian
Kristina Toutanova. 2019. BERT: Pre-training of Luo, Linquan Liu, Tao Qin, Xiangyang Li, Edward
deep bidirectional transformers for language under- Lin, and Tie-Yan Liu. 2021a. Fastcorrect: Fast error
standing. In Proceedings of NAACL-HLT, pages correction with edit alignment for automatic speech
4171–4186, Minneapolis, Minnesota. recognition. In Advances in NeurIPS, pages 21708–
21719.
Ming Dong, Yujing Chen, Miao Zhang, Hao Sun, and
Tingting He. 2024. Rich semantic knowledge en-
hanced large language models for few-shot Chinese Jiahao Li, Quan Wang, Zhendong Mao, Junbo Guo,
spell checking. ArXiv preprint, abs/2403.08492. Yanyan Yang, and Yongdong Zhang. 2022. Improv-
ing Chinese spelling check by character pronuncia-
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and tion prediction: The effects of adaptivity and granu-
Dan Alistarh. 2022. GPTQ: accurate post-training larity. In Proceedings of EMNLP, pages 4275–4286,
quantization for generative pre-trained transformers. Abu Dhabi, United Arab Emirates.
ArXiv preprint, abs/2210.17323.
Kunting Li, Yong Hu, Liang He, Fandong Meng, and
Saibo Geng, Martin Josifoski, Maxime Peyrard, and Jie Zhou. 2024. C-LLM: learn to check chinese
Robert West. 2023. Grammar-constrained decoding spelling errors character by character. ArXiv preprint,
for structured NLP tasks without finetuning. In Pro- abs/2406.16536.
ceedings of EMNLP, pages 10932–10952, Singapore.
Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang,
Cheng-Yu Hsieh, Chun-Liang Li, Chih-kuan Yeh, Jason Eisner, Tatsunori Hashimoto, Luke Zettle-
Hootan Nakhost, Yasuhisa Fujii, Alex Ratner, Ranjay moyer, and Mike Lewis. 2023b. Contrastive decod-
Krishna, Chen-Yu Lee, and Tomas Pfister. 2023. Dis- ing: Open-ended text generation as optimization. In
tilling step-by-step! outperforming larger language Proceedings of ACL, pages 12286–12312, Toronto,
models with less training data and smaller model Canada.
sizes. In Findings of ACL, pages 8003–8017, Toronto,
Canada.
Yinghui Li, Haojing Huang, Shirong Ma, Yong Jiang,
Yong Hu, Fandong Meng, and Jie Zhou. 2024. CSCD- Yangning Li, Feng Zhou, Hai-Tao Zheng, and Qingyu
NS: a Chinese spelling check dataset for native Zhou. 2023a. On the (in)effectiveness of large lan-
speakers. In Proceedings of ACL, pages 146–159, guage models for Chinese text correction. ArXiv
Bangkok, Thailand. preprint, abs/2307.09007.

Qiang Huang, Peijie Huang, Xinrui Zhang, Weijian Xie, Zihong Liang, Xiaojun Quan, and Qifan Wang. 2023.
Kaiduo Hong, Bingzhou Chen, and Lei Huang. 2014. Disentangled phonetic representation for Chinese
Chinese spelling check system based on tri-gram spelling correction. In Proceedings of ACL, pages
model. In Proceedings of CIPS-SIGHAN, pages 173– 13509–13521, Toronto, Canada.
178, Wuhan, China.
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-
Wangjie Jiang, Zhihao Ye, Zijing Ou, Ruihui Zhao, Ming Chen, Wei-Chen Wang, Guangxuan Xiao,
Jianguang Zheng, Yi Liu, Bang Liu, Siheng Li, Yu- Xingyu Dang, Chuang Gan, and Song Han. 2024.
jiu Yang, and Yefeng Zheng. 2022. Mcscset: A AWQ: activation-aware weight quantization for on-
specialist-annotated dataset for medical-domain Chi- device LLM compression and acceleration. In Pro-
nese spelling correction. In Proceedings of CIKM, ceedings of MLSys.
pages 4084–4088.
Yichong Leng, Xu Tan, Wenjie Liu, Kaitao Song, Rui Linfeng Liu, Hongqiu Wu, and Hai Zhao. 2024. Chi-
Wang, Xiang-Yang Li, Tao Qin, Edward Lin, and Tie- nese spelling correction as rephrasing language
Yan Liu. 2023. Softcorrect: Error correction with soft model. In Proceedings of the AAAI, pages 18662–
detection for automatic speech recognition. In Thirty- 18670.
Seventh AAAI Conference on Artificial Intelligence,
AAAI 2023, Thirty-Fifth Conference on Innovative Shulin Liu, Shengkang Song, Tianchi Yue, Tao Yang,
Applications of Artificial Intelligence, IAAI 2023, Huihui Cai, TingHao Yu, and Shengli Sun. 2022.
Thirteenth Symposium on Educational Advances in CRASpell: A contextual typo robust approach to
Artificial Intelligence, EAAI 2023, Washington, DC, improve Chinese spelling correction. In Findings of
USA, February 7-14, 2023, pages 13034–13042. ACL, pages 3008–3018, Dublin, Ireland.

Yichong Leng, Xu Tan, Rui Wang, Linchen Zhu, Jin Xu, Qi Lv, Ziqiang Cao, Lei Geng, Chunhui Ai, Xu Yan, and
Wenjie Liu, Linquan Liu, Xiang-Yang Li, Tao Qin, Guohong Fu. 2023. General and domain-adaptive
Edward Lin, and Tie-Yan Liu. 2021b. FastCorrect Chinese spelling check with error-consistent pretrain-
2: Fast error correction on multiple candidates for ing. TALLIP, 22(5).
Xinyin Ma, Gongfan Fang, and Xinchao Wang. 2023. Yiyu Li, Youxin Jiang, Yuchen Gao, Yupeng Zhang,
Llm-pruner: On the structural pruning of large lan- Zenan Zhou, and Zhiying Wu. 2023a. Baichuan 2:
guage models. In Advances in NeurIPS, volume 36, Open large-scale language models. ArXiv preprint,
pages 21702–21720. abs/2309.10305.
Sean O’Brien and Mike Lewis. 2023. Contrastive de- Liner Yang, Xin Liu, Tianxin Liao, Zhenghao Liu,
coding improves reasoning in large language models. Mengyan Wang, Xuezhi Fang, and Erhong Yang.
ArXiv preprint, abs/2309.09117. 2023b. Is Chinese spelling check ready? understand-
ing the correction behavior in real-world scenarios.
Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia AI Open, 4:183–192.
Tsvetkov, Luke Zettlemoyer, and Scott Wen-tau
Yih. 2023. Trusting your evidence: Hallucinate Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen,
less with context-aware decoding. ArXiv preprint, and Yoon Kim. 2024. Parallelizing linear transform-
abs/2305.14739. ers with the delta rule over sequence length. ArXiv
preprint, abs/2406.06484.
Yuen-Hsien Tseng, Lung-Hao Lee, Li-Ping Chang, and
Hsin-Hsi Chen. 2015. Introduction to SIGHAN 2015 Jui-Feng Yeh, Sheng-Feng Li, Mei-Rong Wu, Wen-
bake-off for Chinese spelling check. In Proceedings Yi Chen, and Mao-Chuan Su. 2013. Chinese word
of SIGHAN, pages 32–37, Beijing, China. spelling correction based on n-gram ranked inverted
index list. In Proceedings of SIGHAN, pages 43–48,
Bailin Wang, Zi Wang, Xuezhi Wang, Yuan Cao, Rif A. Nagoya, Japan.
Saurous, and Yoon Kim. 2023. Grammar prompting
for domain-specific language generation with large Junjie Yu and Zhenghua Li. 2014. Chinese spelling er-
language models. ArXiv preprint, abs/2305.19234. ror detection and correction based on language model,
pronunciation, and shape. In Proceedings of CIPS-
Dingmin Wang, Yan Song, Jing Li, Jialong Han, and SIGHAN, pages 220–223, Wuhan, China.
Haisong Zhang. 2018. A hybrid approach to auto-
matic corpus generation for Chinese spelling check. Liang-Chih Yu, Lung-Hao Lee, Yuen-Hsien Tseng, and
In Proceedings of EMNLP, pages 2517–2527, Brus- Hsin-Hsi Chen. 2014. Overview of SIGHAN 2014
sels, Belgium. bake-off for Chinese spelling check. In Proceedings
of CIPS-SIGHAN, pages 126–132, Wuhan, China.
Hongqiu Wu, Shaohua Zhang, Yuchen Zhang, and Hai
Zhao. 2023. Rethinking masked language modeling Shaohua Zhang, Haoran Huang, Jicong Liu, and Hang
for Chinese spelling correction. In Proceedings of Li. 2020. Spelling error correction with soft-masked
ACL, pages 10743–10756, Toronto, Canada. BERT. In Proceedings of ACL, pages 882–890, On-
line.
Shih-Hung Wu, Chao-Lin Liu, and Lung-Hao Lee. 2013.
Chinese spelling check evaluation at SIGHAN bake- Chenxi Zhu, Ziqiang Ying, Boyu Zhang, and Feng Mao.
off 2013. In Proceedings of SIGHAN, pages 35–42, 2022. MDCSpell: A multi-task detector-corrector
Nagoya, Japan. framework for Chinese spelling correction. In Find-
ings of ACL, pages 1244–1253, Dublin, Ireland.
Weijian Xie, Peijie Huang, Xinrui Zhang, Kaiduo Hong,
Qiang Huang, Bingzhou Chen, and Lei Huang. 2015. Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan,
Chinese spelling check system based on n-gram Jingqi Tong, Conghui He, and Yu Cheng. 2024.
model. In Proceedings of SIGHAN, pages 128–136, Llama-moe: Building mixture-of-experts from
Beijing, China. llama with continual pre-training. ArXiv preprint,
abs/2406.16554.
Heng-Da Xu, Zhongli Li, Qingyu Zhou, Chao Li,
Zizhen Wang, Yunbo Cao, Heyan Huang, and Xian-
Ling Mao. 2021. Read, listen, and see: Leveraging
multimodal information helps Chinese spell check-
ing. In Proceedings of ACL-IJCNLP, pages 716–728,
Online.
Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang,
Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang,
Dong Yan, Fan Yang, Fei Deng, Feng Wang, Feng
Liu, Guangwei Ai, Guosheng Dong, Haizhou Zhao,
Hang Xu, Haoze Sun, Hongda Zhang, Hui Liu, Ji-
aming Ji, Jian Xie, Juntao Dai, Kun Fang, Lei Su,
Liang Song, Lifeng Liu, Liyun Ru, Luyao Ma, Mang
Wang, Mickel Liu, MingAn Lin, Nuolan Nie, Pei-
dong Guo, Ruiyang Sun, Tao Zhang, Tianpeng Li,
Tianyu Li, Wei Cheng, Weipeng Chen, Xiangrong
Zeng, Xiaochuan Wang, Xiaoxi Chen, Xin Men,
Xin Yu, Xuehai Pan, Yanjun Shen, Yiding Wang,
A Special Acknowledgements Corrected Ñ Input
j Ñ ● q x z ■ ✕ ■✕ ■ ✕ ■
✕ ■
✕
We would like to extend our special thanks to all q Ñ j ● x ■ ✕ c ■ ✕ ■ ✕ ✕
■ ✕
■
anonymous reviewers for their valuable comments x Ñ j q ● ■ ✕ ■✕ s ■ ✕ ✕
■ ✕
■
and suggestions. z Ñ j ✕ ■
■ ✕ ● c s zh ✕
■ ✕
■
Reviewer gUCq highlighted unclear descrip- c Ñ ✕
■ q ■✕ z ● s ■ ✕ ch ✕
■
tions and missing experiments, such as the com- s Ñ ✕
■ ✕ ■
■ ✕ z c ● ■ ✕ ✕
■ sh
parison with simpler LMs, in our initial version of zh Ñ ✕
■ ✕ ■
■ ✕ z ■ ✕ ■✕ ● ch sh
the paper. The revisions made in response to these ch Ñ ✕
■ ✕ ■
■ ✕ ■ ✕ c ■ ✕ zh ● sh
suggestions have significantly improved the quality sh Ñ ✕
■ ✕ ■
■ ✕ ■ ✕ ■✕ s zh ch ●
of our work. r Ñ ● l ■✕ ■ ✕ ■✕ ■✕ ■ ✕ ✕
■
Reviewer B44m provided strong positive feed- l Ñ r ● n d t ■ ✕ ■ ✕ ✕
■
back on our work while also identifying missing n Ñ ✕
■ l ● d t ■ ✕ ■ ✕ ✕
■
details in the experimental setup, the absence of d Ñ ✕
■ l n ● t b ■ ✕ ✕
■
results on few-shot settings with GPT-4, and other t Ñ ✕
■ l n d ● ■ ✕ p ✕
■
aspects. Addressing these points made our paper b Ñ ✕
■ ✕ ■
■ ✕ d ■ ✕ ● p m
more comprehensive and rigorous. p Ñ ✕
■ ✕ ■
■ ✕ ■ ✕ t b ● ✕
■
Reviewer cWnK raised concerns about the flex- m Ñ ✕
■ ✕ ■
■ ✕ ■ ✕ ■✕ b p ●
ibility of introducing new knowledge. This insight- g Ñ ● k h ■ ✕
ful comment motivated us to further explore the k Ñ g ● h ■ ✕
topic and provide a simple solution in §5.5. h Ñ g k ● f
Reviewer cnvj, wY4T, and QXDJ gave our work f Ñ ✕
■ ✕ h ●
■
high evaluations and provided numerous construc-
tive comments and suggestions. Their recognition Table 9: Consonants with similar pronunciation.
encourages us to continue refining our paper.

B Implement of Distortion Model due to the common mispronunciation of the con-

sonant “q” as “j”. A list of consonants and vowels
B.1 Standard of Transformation Types considered similar can be found in Tables 9 and 10,
Identical Transformations An identical distor- respectively.
tion occurs when the input character is the same as
the correct character. Similar Shape The similarity in the shape of
characters is evaluated by combining their four-
Same Pinyin Characters that share the same pro- corner code with their radical and component infor-
nunciation, disregarding tone, undergo a “Same mation. For example, the characters “机” and “仉”
Pinyin” distortion. Due to the existence of het- have the four-corner codes “47910” and “27210”,
eronyms in Chinese, such as “和”, which can be respectively. Given that the last digit primarily
pronounced in multiple ways including “hé”, “hè”, serves to distinguish characters with identical pre-
“huó”, “huò”, and “hú”, we classify two charac- ceding digits and that “机” and “仉” share two of
ters as undergoing a same pinyin distortion if they these digits, their four-corner code similarity is cal-
share at least one pronunciation. The pypinyin9 culated as 2 ˆ 14 “ 0.5. Considering their radical
library is utilized to determine character pronun- and component (“木, 几” for “机” and “人, 几” for
ciations, with the ktghz2013 and large_pinyin “仉”), which share the component “几” but differ
from pypinyin-dict10 providing a more accurate in radicals, their similarity is 1 ˆ 21 “ 0.5. Thus,
pronunciation for these determinations. the overall similarity is averaged to 0.5. With a
similarity threshold set at 0.45, these characters are
Similar Pinyin We categorize distortions as
considered to undergo a similar shape distortion.
“Similar Pinyin” when two characters have pro-
Furthermore, character pairs where one is a radical
nunciation that is recognized as similar by prede-
or component of the other, such as “机” and “几”,
fined rules, which are based on Yang et al. (2023b).
are also classified under similar shape distortions.
For instance, ‘qı̄” and “jı̄” are considered similar
9
https://github.com/mozillazg/python-pinyin All non-Chinese characters are only allowed to
10
https://github.com/mozillazg/pypinyin-dict be transformed into themselves.
Corrected Ñ Input tokens. As the subset of relevant tokens is sub-
an Ñ ● ang uan uang ian ■
✕ stantially smaller than the complete token set, em-
ang Ñ an ● uan uang ✕ iang
■ ploying an inverted index considerably reduces the
uan Ñ an ang ● uang ian ■
✕ computational burden.
uang Ñ an ang uan ● ✕ iang
■
ian Ñ an ■✕ uan ■ ✕ ● iang B.4 Small Tricks for Distortion Model
iang Ñ ✕ ang ■
■ ✕ uang ian ●
We adopt three small tricks to enhance our distor-
en Ñ ● eng un ✕
■ tion model. First, for character pairs commonly
eng Ñ en ● ■ ✕ ✕
■ misused in everyday writing, such as “的”, “地”,
un Ñ en ■✕ ● ong and “得”, we categorize these as “Identical” dis-
ong Ñ ✕ ■
■ ✕ un ● tortions, allowing the model to correct these errors
in Ñ ● ing with lower difficulty.
ing Ñ in ●
Second, we found that, although the previously
o Ñ ● uo described rules adequately cover most similar re-
uo Ñ o ● lationships between characters, a few exceptions,
ü Ñ ● u approximately 0.01% of total character pairs, still
u Ñ ü ● persist. To identify these outliers, we leveraged
Table 10: Vowels with similar pronunciation.
tools from previous studies (Wu et al., 2023; Hu
et al., 2024) by incorporating their structure con-
fusion sets and spelling similarity matrices. We
B.2 Type Priority classify character pairs found within the structure
In scenarios where a character can be classified un- confusion set or those with a spelling similarity
der multiple distortion types, for example, “机” (jı̄) matrix distance of less than 1 as “Other Similar”
and “玑” (jı̄), which can be classified as both having distortions.
the same pinyin and a similar shape, we prioritize Finally, we have chosen not to entirely exclude
the distortion type according to the following or- unrelated distortions. Instead, we allow each to-
der: 1) Identical; 2) Same Pinyin; 3) Similar ken to possess up to one unrelated character dis-
Pinyin; 4) Similar Shape; 5) Unrelated. tortion, to which we assign a very low probability
(log pDM “ ´15).
B.3 Using an Inverted Index for Efficient Employing these tricks has led to marginal yet
Distortion Model Calculation consistent improvements in our approach’s perfor-
During each decoding step, the distortion model mance.
calculates the probability of transforming the input
sequence xa:b into a candidate token ti : C Details of Experiments
k
ÿ C.1 Details of Real-world Test Sets
gpx, ti q “ log pDM pcr | xl`r q, (9)
This section details the test sets used in our study,
r“1
providing insights into their composition and rele-
where the function gpx, ti q must be computed for vance to real-world Chinese text.
each candidate token ti in the vocabulary V, result- ‚ Sighan series: This series of datasets is one
ing in a huge computational cost. of the most widely used benchmark datasets for
To address this challenge, we propose the use Chinese spelling correction (Wu et al., 2013; Yu
of an inverted index to reduce the calculation pro- et al., 2014; Tseng et al., 2015). However, it faces
cess, by only considering relevant tokens, and ig- criticism for two main reasons: firstly, it consists
noring irrelevant tokens. For a token, we can pre- of essays written by Chinese learners, which may
construct indexed entries to represent it, such as not accurately represent typical Chinese texts. Sec-
<0,ji,SamePinyin>, <1,kou,SimilarPinyin>, ondly, its limited diversity could hinder the eval-
and <0,仉,SimilarShape> for “机构” (jı̄ gòu). uation of models’ generalization capabilities. De-
Upon receiving an input sequence, the index en- spite these concerns, we include it in our evaluation
ables rapid retrieval of relevant tokens, thereby lim- to allow for comparison with prior studies. How-
iting probability calculations exclusively to these ever, we utilize the revised version by Yang et al.
Datasets rSighans CSCD MCSC ECSpell
Subsets Y13 Y14 Y15 Test Test Law Med Odw
#Sentence 1,000 1,062 1,100 5,000 19,650 500 500 500
Erroneous Sentence Ratio 97.70 56.69 56.18 46.06 50.00 51.00 45.20 52.40
Average Length 74.33 50.01 30.64 57.63 10.91 29.74 49.60 40.51
Average Error/Sentence 1.48 0.88 0.78 0.51 0.93 0.78 0.71 0.81
Distortion Type Proportion (%)
Identical 98.01 98.25 97.45 99.12 91.47 97.38 98.56 98.01
Same Pinyin 1.62 1.30 1.83 0.74 6.60 1.82 1.15 1.55
Similar Pinyin 0.28 0.40 0.66 0.13 1.05 0.51 0.19 0.28
Similar Shape 0.05 0.01 0.03 0.00 0.39 0.25 0.08 0.13
Unrelated 0.04 0.04 0.02 0.00 0.45 0.04 0.01 0.02
Recall Upper Bound 97.24 97.18 98.71 99.70 90.82 97.65 98.67 98.47
Datasets Lemon Pseudo-Dev
Subsets Car Cot Enc Gam Med New Nov –
#Sentence 3,410 1,026 3,434 400 2,090 5,892 6,000 2,000
Erroneous Sentence Ratio 51.09 46.20 50.99 38.75 50.38 50.00 50.23 93.55
Average Length 43.44 40.12 39.83 32.99 39.28 25.16 36.24 36.94
Average Error/Sentence 0.56 0.47 0.52 0.41 0.49 0.55 0.57 1.42
Distortion Type Proportion (%)
Identical 98.64 98.78 98.63 98.73 98.64 97.80 98.43 96.15
Same Pinyin 0.90 0.75 0.93 0.89 0.94 1.50 0.95 2.34
Similar Pinyin 0.31 0.25 0.28 0.26 0.27 0.51 0.43 0.78
Similar Shape 0.02 0.07 0.06 0.01 0.02 0.05 0.02 0.40
Unrelated 0.12 0.14 0.09 0.11 0.12 0.13 0.16 0.31
Recall Upper Bound 91.38 89.34 94.28 90.54 92.82 93.98 89.08 88.03

Table 11: The statistics of the datasets used in the experiments. Recall Upper Bound represents the sentence-level
upper bound of the recall under the distortion model that we use in this work.

(2023b), which has manually verified and corrected tracts, news, and novels. The original dataset also
the errors in the original dataset. includes sighan 15 as a subset, which we have con-
‚ CSCD-NS: A real-world Chinese social me- sidered as a part of the Sighan series and excluded
dia corpus collected and annotated by Hu et al. from Lemon.
(2024). It can better represent the variety of texts The detailed statistics of these datasets are shown
found in real-world settings and includes a broad in Table 11.
spectrum of errors. The recall upper bound in the statistics is ob-
‚ MCSCSet: A large-scale corpus from the tained by calculating the number of sentences
medical domain, collected and annotated by Jiang that can potentially be fully corrected out of the
et al. (2022). It features numerous errors specific total number of sentences in the dataset. A
to medical terminology, making it an excellent re- sentence has the potential to be fully corrected
source for evaluating models’ generalization capa- if all the distortion types between each pair of
bilities in this area. source and target characters can be categorized into
‚ ECSpell: A small-scale, multi-domain corpus Identical, Same Pinyin, Similar Pinyin, and
annotated by Lv et al. (2023). It encompasses three Similar Shape.
domains: legal documents, medical treatments, and
official document writing. C.2 Implementation Details of Prompt-based
‚ Lemon: The most recent and largest multi- Method
domain corpus to date, collected and annotated by In this work, we use the prompt-based method to
Wu et al. (2023). It spans seven domains: law, activate the CSC ability of the baseline LLMs. The
medicine, encyclopedia, gaming, automotive, con- task-specific instructions are adopted from Li et al.
System and User Prompts for baselines

System Prompt:
你是一个优秀的中文拼写纠错模型，中文拼写纠错模型即更正用户输入句子中的拼写错
误。
User Prompt:
你需要识别并纠正用户输入的句子中可能的错别字并输出正确的句子，纠正时必须保证改
动前后句子必须等长，在纠正错别字的同时尽可能减少对原句子的改动(不添加额外标点符
号，不添加额外的字，不删除多余的字)。只输出没有错别字的句子，不要添加任何其他解
释或说明。如果句子没有错别字，就直接输出和输入相同的句子。

Figure 4: Prompt templates used in our FSP and ZSP baselines.

(2023a). The prompt used for the baselines are Model Version Strategy
shown in Figure 4. We disable the sampling mecha- Baichuan2 13B Base Similariy
nism and set the temperature to 0.0 to ensure deter- Qwen1.5 14B Base Balanced
ministic decoding. For few-shot prompting meth- InternLM2 20B Chat Similariy
ods, where the example selection strategy involves GPT3.5 – Balanced
random selection, we conduct three runs and report GPT4 – Balanced
the average results. The only exception is the GPT4
model, which we run only once due to the high cost Table 12: The model version and examples selection
strategy we used for few-shot baseline.
of using the model.

C.3 Few-shot Examples Selection Strategy for examples, which contains a higher proportion of
Baselines erroneous sentences (87%–94%) compared to the
Li et al. (2023a) proposed three selection strategies target data (50%–56%). This discrepancy causes
for CSC few-shot prompting methods: 1) Random: the GPT models to be more aggressive in correcting
randomly select m examples; 2) Balanced: ran- errors.
domly select m examples with a balanced distribu- To ensure the effectiveness of the few-shot
tion of correct and error examples; 3) Similarity: prompting method, we conducted experiments to
select the m most similar in-context examples for determine the optimal strategy for each LLM we
each input sentence using the BM25 and Rouge sim- used. For open-source LLMs, which include both
ilarity metrics. ‘Base’ and ‘Chat’ versions, we experimented with
They found that the performance of few-shot both versions and selected the best one for each
prompting depends on the selection of in-context LLM. The final choice of selection strategy is
examples. Different selection strategies may lead shown in Table 12.
to distinct results. Among the three strategies, C.4 Pre- & Post-processing for Baselines
Similarity was found to be the most effective.
In this study, we employ several pre- and post-
However, the Similarity strategy is not always
processing techniques to mitigate the errors intro-
the optimal choice. In preliminary experiments,
duced by the limitations of baseline systems. This
we observed that this strategy sometimes causes
ensures a fair comparison between our approach
GPT family models to perform worse than the zero-
and the baselines.
shot prompting method. Upon analyzing the re-
sults, we found that GPT models are particularly BERT-based baselines Most current CSC mod-
sensitive to discrepancies in the proportion of erro- els utilize BERT as the backbone. However,
neous sentences between the few-shot prompting BERT presents challenges that can degrade per-
examples and the target data. The examples se- formance during evaluation: 1) Full-width Punctu-
lected using the Similarity strategy tend to have ation: BERT’s tokenization process may normalize
a similar proportion of erroneous sentences as the full-width punctuation to half-width, leading to nu-
dataset used for selection. In our work, we use merous unnecessary punctuation replacements. To
Pseudo-Dev dataset to select few-shot prompting counter this, we prevent the model from modify-
Datasets rSighans ECSpell Lemon
Subsets Y13 Y14 Y15 Law Med Odw Car Cot Enc Gam Med New Nov
Domain-Specific SOTAs (Trained on in-domain gold-standard data of each dataset)
ReaLiSe 70.1 64.0 73.9 38.9 23.1 42.8 32.5 40.1 29.1 12.6 31.8 31.2 20.2
Liu et al. (2024) – – – 91.2 82.4 83.6 – – – – – – –
Domain-General SOTAs (Trained on about 34M synthetic CSC data)
Finetuned BERT 50.6 40.4 51.6 58.5 47.8 65.1 52.0 63.1 45.3 32.8 50.7 56.1 35.8
Softmasked BERT 51.6 40.2 51.3 58.5 48.5 65.9 52.3 63.8 44.1 28.3 48.9 55.6 37.7
ReLM 45.8 40.6 55.5 60.4 50.9 66.5 53.3 66.7 47.7 33.7 53.8 58.8 37.1
LLMs (without CSC-specific training)
ZSP 26.4 12.0 18.5 37.6 23.0 43.0 15.3 14.9 24.0 12.7 21.6 19.8 14.1
Baichuan2 FSP 41.1 23.1 31.3 60.2 50.4 60.0 32.2 45.3 38.9 24.6 39.0 39.7 26.4
(13B)
OUR 63.6 54.1 59.6 82.6 78.9 92.0 52.7 62.9 51.9 37.1 60.1 63.9 43.5
ZSP 41.6 17.4 28.1 53.3 38.9 60.7 28.5 42.0 33.8 20.5 35.3 37.3 25.3
Qwen1.5 FSP 45.9 25.4 31.6 61.4 49.1 66.5 35.0 47.6 43.4 27.9 38.6 38.7 29.2
(14B)
OUR 56.9 48.6 57.6 84.1 73.2 87.4 46.0 59.9 44.6 28.3 52.9 55.8 36.4
ZSP 42.3 20.9 29.7 47.7 31.9 55.9 29.8 42.6 34.3 21.2 40.0 34.7 27.2
InternLM2 FSP 55.9 27.7 32.9 45.9 38.2 65.3 31.3 46.7 37.1 25.4 43.4 37.9 29.3
(20B)
OUR 57.8 53.1 60.5 83.9 72.3 91.1 49.7 59.0 48.2 31.8 55.9 63.3 40.5

Table 13: The detailed sentence level correction F1 score.

Datasets rSighans ECSpell Lemon

Subsets Y13 Y14 Y15 Law Med Odw Car Cot Enc Gam Med New Nov
Domain-Specific SOTAs (Trained on in-domain gold-standard data of each dataset)
ReaLiSe 85.0 76.3 80.9 48.7 34.4 53.0 37.4 42.7 32.9 16.3 33.8 35.1 23.2
Domain-General SOTAs (Trained on about 34M synthetic CSC data)
Finetuned BERT 64.3 51.0 57.2 66.3 59.0 69.5 53.0 64.1 46.0 35.6 52.3 57.5 36.3
Softmasked BERT 65.6 49.3 57.3 67.2 61.3 70.0 53.6 63.3 45.4 31.6 51.0 57.9 38.5
ReLM 58.6 51.1 61.0 68.3 63.9 73.0 54.4 66.1 48.2 37.5 55.1 60.5 37.1
LLMs (without CSC-specific training)
ZSP 29.6 11.2 14.5 20.5 16.6 29.8 7.8 7.4 12.5 4.1 11.9 14.2 10.6
Baichuan2 FSP 51.8 29.7 34.0 54.9 52.5 51.8 14.0 35.3 23.0 9.5 29.5 39.0 26.2
(13B)
OUR 79.1 66.3 67.3 88.8 86.7 93.8 57.5 64.0 56.5 39.6 61.7 66.2 47.9
ZSP 48.8 18.9 26.5 53.5 35.4 58.1 27.1 26.8 32.0 12.7 32.1 35.1 21.5
Qwen1.5 FSP 51.0 29.5 33.2 63.3 44.4 66.9 22.7 39.8 34.7 14.3 34.9 36.5 28.4
(14B)
OUR 75.2 62.8 66.0 88.6 84.5 91.6 52.4 62.9 49.6 34.3 54.6 59.5 42.6
ZSP 46.0 18.1 27.3 40.5 22.8 49.3 24.7 31.9 29.7 12.3 31.0 29.2 26.6
InternLM2 FSP 46.8 25.5 33.4 56.7 40.0 66.3 24.5 34.2 30.4 10.4 40.9 32.9 28.9
(20B)
OUR 76.8 65.5 67.8 88.9 83.6 93.8 54.6 62.0 53.1 36.7 57.9 65.9 45.3

Table 14: The detailed character level correction F1 score.

ing the original punctuation; 2) Special Tokens: maximum length of 128 characters and concatenate
BERT-based models may predict a special ‘[UNK]‘ the remaining characters to the output.
token in some cases, resulting in the removal of
the original character. In these instances, we retain LLM baselines The outputs of LLMs some-
the original character when a special token is pre- times fail to align with evaluation, primarily due
dicted; 3) Input Length Limitation: BERT-based to their inadequate instruction-following capabil-
models show limited generalization beyond their ity. To address this, we apply specific rules for
maximum training length. We truncate inputs to a post-processing: 1) Redundant Phrases: We re-
move redundant phrases such as “修改后的句
Datasets rSighans ECSpell Lemon
Subsets Y13 Y14 Y15 Law Med Odw Car Cot Enc Gam Med New Nov
Domain-Specific SOTAs (Trained on in-domain gold-standard data of each dataset)
ReaLiSe 13.0 9.6 7.7 10.6 18.6 11.8 20.9 13.4 20.8 22.5 16.5 16.7 22.6
Liu et al. (2024) – – – 7.4 6.5 2.2 – – – – – – –
Domain-General SOTAs (Trained on about 34M synthetic CSC data)
Finetuned BERT 21.7 16.5 12.5 4.9 11.3 2.9 12.3 8.3 13.9 22.5 8.3 9.4 17.3
Softmasked BERT 13.0 17.6 14.5 6.1 11.7 5.0 12.4 7.1 14.8 20.4 9.6 10.6 16.6
ReLM 4.4 15.0 9.5 7.8 11.0 7.1 12.1 5.6 12.6 20.8 5.7 8.4 17.5
LLMs (without CSC-specific training)
ZSP 34.8 58.3 54.4 26.9 43.1 21.0 40.6 54.2 35.9 41.6 35.4 41.1 37.6
Baichuan2 FSP 21.7 19.4 23.2 7.8 9.1 0.4 8.3 7.4 10.2 20.0 4.6 8.3 7.7
(13B)
OUR 8.7 14.1 8.3 4.5 9.9 0.4 5.9 6.9 8.9 19.2 3.9 5.7 13.0
ZSP 34.8 54.4 34.2 5.7 35.4 2.1 18.5 15.8 13.5 18.4 11.8 14.0 20.7
Qwen1.5 FSP 15.9 30.9 31.7 5.3 11.6 0.8 8.9 12.7 10.1 14.7 9.5 7.8 5.5
(14B)
OUR 21.7 19.6 10.2 4.9 11.7 2.9 11.2 6.3 14.8 29.4 5.4 10.1 21.2
ZSP 65.2 58.0 48.8 26.5 50.7 17.7 28.8 23.7 30.0 30.6 23.0 34.0 24.2
InternLM2 FSP 21.7 39.8 33.6 13.9 30.7 2.5 18.2 12.3 18.1 23.7 10.0 22.4 16.1
(20B)
OUR 13.0 16.5 8.3 2.5 12.4 0.4 8.5 6.9 12.2 22.5 3.7 6.1 15.1

Table 15: The detailed sentence level false positive rate.

子是：” (The corrected sentence is:), identified also makes it vulnerable when the number of eval-
through common patterns input in the model out- uation samples is limited, such as ECSpell dataset,
put; 2) Redundant Punctuation: Many sentences which contains only 500 sentences for each sub-
in the dataset lack terminal periods, yet some mod- domain, and lacks of ability to detect subtle dif-
els inappropriately add them. To prevent incorrect ferences between models when evaluating on the
evaluations due to this discrepancy, we remove any same dataset, which a sentence contains multiple
added terminal period if the original sentence did errors.
not have one.
Character-level Correction F1 (C-F) Different
D Details of Evaluation from the sentence-level F1 score, the character-
level F1 score focuses on the correctness of each
D.1 Evaluation Metrics character in the sentence. Similar to the sentence-
In this work, we use the following metrics to evalu- level F1 score, the character-level F1 score also
ate the performance of our approach and the base- consists of two parts: precision (C-P) and recall
lines. (C-R). C-P is the proportion of correctly corrected
characters among all characters modified by the
Sentence-level Correction F1 (S-F) S-F consists model, and C-R is the proportion of correctly cor-
of two parts: precision (S-P) and recall (S-R): rected characters among all characters need to be
S-P ˆ S-R corrected.
S-F “ 2 ˆ . (10) Conventional character-level metrics of CSC are
S-P ` S-R
based on point-wise evaluation, which fall short
where S-P represents the proportion of correctly when models insert or delete characters, as they
corrected sentences among all sentences modified can inaccurately mark all subsequent characters as
by the model, and S-R represents the proportion of incorrect due to a single addition or deletion. To
correctly corrected sentences among all sentences overcome this, we implement Levenshtein algo-
need to be corrected. rithm to align the model output with the target
A sentence is considered correctly corrected if sentence and calculate the character-level metrics
and only if all errors in the sentence are fixed and based on the aligned results. This alignment-based
no new errors are introduced. This strict definition method provides a more reasonable evaluation of
makes the sentence-level F1 score rigorous, but character-level performance.
Input 商务部前头，11月底完成 the word “牵头” (qiāntóu, led by) is misspelled as
Reference 商务部牵头，11月底完成 “前头” (qiántóu, front) in the input sentence. Both
ReLM 商务部牵头，11月底完成 the ZSP and FSP baselines mistakenly put their at-
BC2 13B ZSP 商务部前面，11月底完成 tention on the character “前” (front) and incorrectly
BC2 13B FSP 商务部日前，11月底完成 correct “前头” to “日前” (a few days ago) and “前
BC2 13B OUR 商务部牵头，11月底完成面” (front), respectively. Such corrections are not
Input 虎珀酸索莉那新片主要功能是什么 only implausible but also linguistically awkward.
Reference 琥珀酸索利那新片主要功能是什么 In contrast, the domain-general model ReLM and
ReLM 琥珀酸索莉那新片主要功能是什么 our approach successfully correct the misspelling.
BC2 13B ZSP 琥珀酸索利那新片主要功能是什么 In the second case (“What are the main functions
BC2 13B FSP 虎珀酸索莉那新片主要功能是什么 of Solifenacin Succinate Tablets”), the name of the
BC2 13B OUR 琥珀酸索利那新片主要功能是什么 drug “琥珀酸索利那新片” (Solifenacin Succinate
Tablets) is misspelled. To correct the misspelling,
Table 16: Qualitative examples of our approach and the the knowledge of the medical domain is required.
baselines. Corrections marked in “Blue” are correct, In this case, the ReLM model fails to correct the
while those in “Red” are incorrect.
misspelling, while the zero-shot prompting base-
line and our approach successfully correct it. It is
Sentence-level False Positive Rate (FPR) Both worth noting that the few-shot prompting baseline
sentence-level F1 score and character-level F1 also fails to correct the misspelling, which indicates
score overlook the cases where the model intro- that the inclusion of inappropriate examples may
duces unnecessary modifications to a de-facto cor- lead to worse performance.
rect sentence. To fill this gap, sentence-level False
Positive Rate (FPR) is proposed to measure the F More Discussions
proportion of sentences that are initially correct but F.1 Impact of the Pre-training Data
modified by the model.
There are two main factors that differentiate LLMs
D.2 Evaluation Settings and Conventions from simpler LMs: the scale of pre-training data
During evaluation, we remove all whitespaces and and the model size. The impact of model size on
convert all full-width punctuation to half-width the performance of LLMs has been discussed in
from the input and output sentences to guarantee a §5.1. In this subsection, we aim to investigate the
fair comparison.11 impact of pre-training data on the performance of
When evaluating the Lemon dataset, we ignore our approach.
all sentences where the input and output sentence We compare Qwen1.5, a recent LLM family,
lengths do not match, following the dataset’s con- with GPT2, which also has a causal LM (decoder-
vention. only) architecture. The GPT2 model family par-
tially overlaps in model size with the Qwen1.5
E More Results model family, but it was trained on a much smaller
dataset, CLUECorpusSmall. The CLUECorpusS-
E.1 Detailed Results mall dataset contains only about 5 billion characters
Due to the space limitation, we only present the and has limited diversity in text sources, including
average results of each dataset in the main text. only news, Wikipedia, forums, and comments.
The detailed results of each dataset are shown in As shown in Table 17, when the model sizes
Table 13, Table 14, and Table 15. are similar, the Qwen1.5 model family outperforms
the GPT2 model family on all three datasets. The
E.2 Qualitative Examples largest performance gap is observed on the Lemon-
We provide two qualitative examples to illustrate Nov dataset, where a smaller 463M Qwen1.5 model
the performance of our approach in Table 16. even outperforms a larger 1.5B GPT2 model by
In the first case (“Led by the Ministry of Com- 7.1% in the sentence-level correction F1 score.
merce, to be completed by the end of November”), This is because the Lemon-Nov dataset contains
11
texts from the novel domain, which is not included
BERT-based models often remove whitespaces during
tokenization and may convert full-width punctuation to half- in the CLUECorpusSmall dataset. These results
width when correcting spelling errors (e.g., ReLM). indicate that the scale and diversity of the pre-
Data rSighan 15 Lemon Nov ECSpell Odw
System Amount S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç S-Fæ C-Fæ FPRç
GPT2 1.5B Small 56.6 64.4 10.4 26.1 31.8 31.4 82.8 85.8 5.5
Qwen1.5 463M 56.3 63.5 10.0 33.2 40.2 22.2 84.7 89.9 3.8
Large
Qwen1.5 1.8B 58.3 65.3 10.3 35.6 42.3 19.9 90.3 92.8 1.7

Table 17: A brief comparison of the performance of LLMs of different sizes and pre-training data amounts on three
datasets.

ECSpell CSCD-NS rSighan 15 Lemon Nov ECSpell Odw

System Method
Law Med Odw Test Dev True Dev True Dev True
34M-ft 60.4 50.9 66.5 51.0 Distortion Model: log pDM
BERT-based
Id-ft 91.2 82.4 83.6 66.2 (73.6) Idt. -0.04 -0.03 -0.04 -0.02 -0.04 -0.02
Id-ft 71.2 35.6 53.8 – Sa.P. -3.75 -4.00 -3.75 -4.66 -3.75 -4.17
GPT2 110M
OUR 66.4 60.0 78.6 – Si.P. -4.85 -5.02 -4.85 -5.45 -4.85 -5.87
Id-ft 86.0 73.2 82.6 56.4 (64.4) Si.S. -5.40 -8.63 -5.40 -8.04 -5.40 -6.66
Baichuan2 7B
OUR 82.1 79.7 89.7 62.7 S-Fæ 59.8 +0.9 43.2 0.0 89.7 -0.8
C-Fæ 68.2 +1.4 47.7 +0.2 93.0 -0.3
Table 18: The S-F of models supervised fine-tuned and FPRç 8.1 0.0 13.6 +0.3 1.3 0.0
those from our approach. Id-ft denotes the model fine-
tuned on the in-domain training data of either ECSpell Table 19: The impact of distortion model on the per-
or CSCD-NS. Scores in parentheses represent the S-F formance of Baichuan2 7B. “True” denotes that the
of the model, which was pre-trained on 2M carefully distortion model is derived from the true distortion dis-
crafted synthetic CSC data prior to being fine-tuned on tribution of each dataset. “ Dev” represents the distortion
the in-domain training data. model from the Pseudo-Dev.

training data are crucial for the performance of phenomenon can be attributed to the characteristics
our approach. of the ECSpell dataset, which, as pointed out by Wu
et al. (2023), contains a high proportion (more than
F.2 Comparison to the Supervised 70%) of error-correction pairs that never appeared
Fine-tuning Method in the training data. The supervised fine-tuning
In this subsection, we compare our approach with method is not effective in handling these unseen
the supervised fine-tuning method. error-correction pairs, whereas our approach can
However, we did not fine-tune the LLMs our- still correct them.
selves, as fine-tuning an LLM on the 34M synthetic
CSC data would be extremely time-consuming and F.3 Influence of Beam Size
computationally expensive. Additionally, the super- During searching the most likely correction se-
vised fine-tuning method typically requires careful quence, the beam search algorithm is used to avoid
hyperparameter tuning to achieve the best perfor- the exponential growth of the search space and the
mance, further increasing the computational cost. local minimum caused by greedy search. Knowing
Instead, we leverage the findings from Li et al. the impact of the beam size on the performance
(2023a), who fine-tuned the Baichuan2 7B and helps researchers to choose a proper beam size to
GPT2 models on the ECSpell dataset, and Hu et al. balance the trade-off between the performance and
(2024), who fine-tuned the Baichuan2 7B model the computational cost. The results are shown in
on the CSCD-NS dataset. Figure 5. Though the larger beam size consistently
The results are shown in Table 18. Compared leads to better performance, the improvement be-
to the BERT-based models, the supervised fine- comes marginal when the beam size is larger than 6.
tuning method is less effective in improving the
performance of causal LMs like GPT2 and recent F.4 Effectiveness of the Estimated Distortion
LLMs such as Baichuan2. Model
Our training-free approach even outperforms the The distortion model is a key component in our
supervised fine-tuning counterpart on the Med and approach. In this work, we utilize a minimal dis-
Odw sub-domains of the ECSpell dataset. This tortion model and directly estimate the distortion
rSighan 15 Lemon Nov 100 ECSpell Odw
70 50
90
60
40 80 S-F
50
70 C-F
30
40
60
30 20
2 4 6 8 10 12 2 4 6 8 10 12 2 4 6 8 10 12
Beam Size Beam Size Beam Size

Figure 5: The scores of Baichuan2 7B with different beam sizes. The solid lines represent the results of our
approach, and the dashed lines represent the results of the few-shot baseline. We can observe that larger beam sizes
may lead to worse C-F scores in few-shot settings.

Inference Speed (ms) System Ctx. S-Fæ C-Fæ FPRç CERç CERRæ
System
per Sent. per Char. No correction – – – 4.83 –
ReLM 14.4 0.4 Domain-Specific SOTAs
ZSP 899.8 22.2 Leng et al. (2021a) ✗ – – – 4.16 13.9
Baichuan2 13B FSP 1,057.4 26.1 Leng et al. (2021b) ✗ – – – 4.11 14.9
OUR 1,541.0 38.0 Leng et al. (2023) ✗ – – – 3.57 26.1
Domain-General SOTAs
Table 20: The inference speed of different models. ✗ 23.7 25.7 5.3 4.39 9.1
Finetuned BERT
✓ 18.2 19.5 1.8 4.43 8.3
✗ 22.6 25.5 5.4 4.43 8.3
Softmasked BERT
probabilities from the statistics of the Pseudo-Dev ✓ 19.8 21.4 1.9 4.39 9.1
dataset. Obviously, this estimation will be different ✗ 24.7 27.5 4.7 4.30 11.0
ReLM
✓ 17.7 18.6 2.5 4.50 6.8
from the true probabilities.
LLMs
To verify the effectiveness of the estimated dis- Baichuan2 OUR ✗ 34.8 43.1 3.8 3.68 23.8
tortion model, we conduct experiments comparing (13B) ✓ 41.7 51.3 3.0 3.29 31.9
the estimated distortion model with the true distor- Qwen1.5 OUR ✗ 28.7 37.4 7.1 4.10 15.1
tion model. The results are presented in Table 19. (14B) ✓ 38.0 48.5 4.4 3.44 28.8
The upper part of the table shows the difference InternLM2 OUR ✗ 33.8 42.6 4.1 3.70 23.4
(20B) ✓ 40.4 51.3 3.0 3.29 31.9
between the estimated distortion model and the
true distortion model. We can see that the esti- Table 21: Results of contextual enhanced spelling cor-
mated one is quite close to the true one, except for rection on AISHELL-1 dataset. ✓ denotes the results of
the Similar Shape distortion type. The lower part models taking 3 preceding sentences as the input prefix.
shows that the difference between the performance All the preceding context are also predicted by the same
is marginal, indicating that the estimated distortion ASR model.
model is sufficient for our approach to achieve a
good performance, and has good generalization
ZSP and FSP baselines, our approach is slower
ability across different datasets.
(1.71ˆ and 1.45ˆ, respectively), primarily due
F.5 Inference Speed to our immature implementation of the distortion
model, which can be further optimized to improve
We conducted a brief runtime analysis to evaluate
inference speed.
the inference speed of our approach. The analysis
was performed using a single NVIDIA A100 40GB
F.6 Context as New Knowledge
GPU with an Intel Xeon Gold 6248R (3.00GHz)
CPU. The batch size was set to 1 for all models, In Section 5.5, we used a toy example to demon-
and other hyperparameters were set to the same strate that our approach can introduce new knowl-
values as in the main experiments. edge into the LLM by merely modifying the input
The average inference speed of each model on prefix. However, in real-world scenarios, it is dif-
the ECSpell-Odw dataset is shown in Table 20. Due ficult to automatically extract the key characters
to the large model size and the autoregressive de- as we did in the toy example and ensure they are
coding process, LLMs are significantly slower than suitable for the input prefix. Luckily, sentences
the BERT-based ReLM model. Compared to the in real-world contexts are not isolated but are part
of a paragraph, and their preceding sentences can
provide valuable information for error correction.
Thus, we can treat the preceding context as new
knowledge and introduce it into the LLM.
Since existing datasets for CSC are composed of
isolated sentences, it is impossible to validate the
effectiveness of using the preceding context as new
knowledge on them. Therefore, we utilize the ASR
error correction dataset derived from AISHELL-1
(Bu et al., 2017), where the sentences are consecu-
tive and part of coherent passages. In this dataset,
(Leng et al., 2021a) used an ASR model to tran-
scribe the speech data, introducing spelling errors
naturally caused by the ASR system.
In addition to conventional CSC metrics, we
also report the Character Error Rate (CER)12
and Character Error Rate Reduction (CERR)13
to compare with the baseline models (Leng et al.,
2021a,b, 2023).
Specifically, we take the three preceding sen-
tences from the source side as the new knowledge:

k “ x´3 ‘ x´2 ‘ x´1 . (11)

where x´i denotes the i-th sentence preceding the

current one.
The results in Table 21 clearly show that our
method can effectively utilize the preceding con-
text as new knowledge to improve the performance
of ASR error correction. Meanwhile, we observe
that the BERT-based baselines cannot effectively
utilize the preceding context to achieve better per-
formance.

12
CER calculates the number of insertions, deletions, and
substitutions edits required to transform the predicted se-
quence into the target sequence:

ninsert ` ndelete ` nreplace

CER “ .
ntarget

13
CERR represents the percentage of CER reduction com-
pared to the baseline model:

CERours
CERR “ 1 ´ .
CERbaseline

Walker Maths
No ratings yet
Walker Maths
4 pages
NLP Sem Answers (All)
No ratings yet
NLP Sem Answers (All)
124 pages
Sentence-Level Feedback Generation For English Lan
No ratings yet
Sentence-Level Feedback Generation For English Lan
7 pages
Spelling Error Correction With BERT Based On Character-Phonetic
No ratings yet
Spelling Error Correction With BERT Based On Character-Phonetic
5 pages
A Multilayer Convolutional Encoder-Decoder Neural Network For Grammatical Error Correction
No ratings yet
A Multilayer Convolutional Encoder-Decoder Neural Network For Grammatical Error Correction
8 pages
FASPell A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-decoder Paradigm
No ratings yet
FASPell A Fast, Adaptable, Simple, Powerful Chinese Spell Checker Based On DAE-decoder Paradigm
10 pages
Bytedance Soft Mask Bert
No ratings yet
Bytedance Soft Mask Bert
9 pages
NLP - CT2 - SET A - Answer Key
No ratings yet
NLP - CT2 - SET A - Answer Key
10 pages
Automatic Grammatical Error Correction Based On Edit Operations Information
No ratings yet
Automatic Grammatical Error Correction Based On Edit Operations Information
12 pages
Normalization of Informal Text
No ratings yet
Normalization of Informal Text
22 pages
Small Language Models Improve Giants by Rewriting Their Outputs
No ratings yet
Small Language Models Improve Giants by Rewriting Their Outputs
16 pages
Automatic Paper Corrector Using NLP - 1650875208
No ratings yet
Automatic Paper Corrector Using NLP - 1650875208
4 pages
General and Domain-Adaptive Chinese Spelling Check With Error-Consistent Pretraining
No ratings yet
General and Domain-Adaptive Chinese Spelling Check With Error-Consistent Pretraining
18 pages
Generating Feedback For English Foreign Language Exercises
No ratings yet
Generating Feedback For English Foreign Language Exercises
10 pages
NLP Sem Questions and Answers
No ratings yet
NLP Sem Questions and Answers
72 pages
Chinese Grammar
No ratings yet
Chinese Grammar
14 pages
A Hybrid Approach To Automatic Corpus Generation For Chinese Spelling Check
No ratings yet
A Hybrid Approach To Automatic Corpus Generation For Chinese Spelling Check
11 pages
Gector - Grammatical Error Correction: Tag, Not Rewrite
No ratings yet
Gector - Grammatical Error Correction: Tag, Not Rewrite
8 pages
Seq2Edits: Sequence Transduction Using Span-Level Edit Operations
No ratings yet
Seq2Edits: Sequence Transduction Using Span-Level Edit Operations
17 pages
NLP Midterm Spring2025
No ratings yet
NLP Midterm Spring2025
7 pages
Experiment: 1
No ratings yet
Experiment: 1
28 pages
IS061299
No ratings yet
IS061299
4 pages
MTE Practice Set
No ratings yet
MTE Practice Set
4 pages
0 Yqn EK3 VG 4 He OTv 089 KX SI1 Ij Wzu Ax T1 Ag Gev OKKJE
No ratings yet
0 Yqn EK3 VG 4 He OTv 089 KX SI1 Ij Wzu Ax T1 Ag Gev OKKJE
4 pages
Lee06g Interspeech
No ratings yet
Lee06g Interspeech
4 pages
Raheja, Vipul Et Al. Co Edit - Text Editing by Task Specific Instruction Tuning
No ratings yet
Raheja, Vipul Et Al. Co Edit - Text Editing by Task Specific Instruction Tuning
18 pages
Supervised Copy Mechanism For Grammatical Error Correction
No ratings yet
Supervised Copy Mechanism For Grammatical Error Correction
10 pages
FULLTEXT01
No ratings yet
FULLTEXT01
63 pages
Kcs072 Natural Language Processing
No ratings yet
Kcs072 Natural Language Processing
2 pages
NLPNOTES
No ratings yet
NLPNOTES
26 pages
NLP Notes-1
No ratings yet
NLP Notes-1
11 pages
H7 W5 NLP - Merged
No ratings yet
H7 W5 NLP - Merged
17 pages
ChatGPT or Grammarly
No ratings yet
ChatGPT or Grammarly
6 pages
G Prompt Syntax
No ratings yet
G Prompt Syntax
5 pages
FCGEC-Fine Grained Corpus For Chinese Grammatical Error Correction
No ratings yet
FCGEC-Fine Grained Corpus For Chinese Grammatical Error Correction
19 pages
Explain To Me Like I Am Five - Sentence Simplification Using Transformers
No ratings yet
Explain To Me Like I Am Five - Sentence Simplification Using Transformers
4 pages
Answer NLP
No ratings yet
Answer NLP
5 pages
Demos 008
No ratings yet
Demos 008
8 pages
Exam
No ratings yet
Exam
10 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Controllable Sentence Simplification: Louis Martin Eric Villemonte de La Clergerie Beno It Sagot Antoine Bordes
No ratings yet
Controllable Sentence Simplification: Louis Martin Eric Villemonte de La Clergerie Beno It Sagot Antoine Bordes
10 pages
Lec04 SpellingCorrection
No ratings yet
Lec04 SpellingCorrection
25 pages
NLP Sample Questions-Stu
No ratings yet
NLP Sample Questions-Stu
4 pages
Mini 3 Merged
No ratings yet
Mini 3 Merged
30 pages
Wa0002.
No ratings yet
Wa0002.
6 pages
Question Bank
No ratings yet
Question Bank
2 pages
It-3035 (NLP) - CS Mid Feb 2024
No ratings yet
It-3035 (NLP) - CS Mid Feb 2024
6 pages
Learning To Grade Short Answer Questions Using Semantic Similarity Measures and Dependency Graph Alignments
No ratings yet
Learning To Grade Short Answer Questions Using Semantic Similarity Measures and Dependency Graph Alignments
11 pages
Learning To Grade Short Answer Questions Using Semantic Similarity Measures and Dependency Graph Alignments
No ratings yet
Learning To Grade Short Answer Questions Using Semantic Similarity Measures and Dependency Graph Alignments
11 pages
Cme4408 p7 Probmodels Med
No ratings yet
Cme4408 p7 Probmodels Med
50 pages
Natural Language Processing
No ratings yet
Natural Language Processing
8 pages
Neuspell: A Neural Spelling Correction Toolkit
No ratings yet
Neuspell: A Neural Spelling Correction Toolkit
7 pages
Capitalization Normalization For Language Modeling With An Accurate and Efficient Hierarchical RNN Model
No ratings yet
Capitalization Normalization For Language Modeling With An Accurate and Efficient Hierarchical RNN Model
5 pages
3 Prompting
No ratings yet
3 Prompting
59 pages
NLP Question
No ratings yet
NLP Question
4 pages
Performance Benchmarking of Automated Sentence Denoising Using Deep Learning
No ratings yet
Performance Benchmarking of Automated Sentence Denoising Using Deep Learning
6 pages
Automatic Prediction On DC Compunds
No ratings yet
Automatic Prediction On DC Compunds
12 pages
Question Bank NLP SOLUTIONS
No ratings yet
Question Bank NLP SOLUTIONS
21 pages
Unit 4 (Text-To-speech Synthesis)
No ratings yet
Unit 4 (Text-To-speech Synthesis)
15 pages
Add Info B-82674EN 01
No ratings yet
Add Info B-82674EN 01
3 pages
Cuenta Maestra l2 Scripts Proteccion l2
No ratings yet
Cuenta Maestra l2 Scripts Proteccion l2
40 pages
Chinese Literature
100% (1)
Chinese Literature
7 pages
1118361-001 MT983 MT993 Display Terminals Reference 198101
No ratings yet
1118361-001 MT983 MT993 Display Terminals Reference 198101
80 pages
SEO Report Format - v1
No ratings yet
SEO Report Format - v1
22 pages
Maths-Part Test-4
No ratings yet
Maths-Part Test-4
6 pages
Rarita-Schwinger Particles in Homogeneous Magnetic Fields, and Inconsistencies of Spin 3/2 Theories
No ratings yet
Rarita-Schwinger Particles in Homogeneous Magnetic Fields, and Inconsistencies of Spin 3/2 Theories
9 pages
Spring 2025 - CS610 - 1
No ratings yet
Spring 2025 - CS610 - 1
2 pages
SAT Suite Question Bank - Maths 2
No ratings yet
SAT Suite Question Bank - Maths 2
63 pages
Northern Cape Revision Pack Mathematical Literacy Grade 12 Answer - Google Search
100% (1)
Northern Cape Revision Pack Mathematical Literacy Grade 12 Answer - Google Search
1 page
Presentation 2
No ratings yet
Presentation 2
13 pages
Stretching The Advanced Learners' English
No ratings yet
Stretching The Advanced Learners' English
2 pages
Ielts 19-4
No ratings yet
Ielts 19-4
3 pages
Chapter 3 Solutions Understanding Machine Learning
No ratings yet
Chapter 3 Solutions Understanding Machine Learning
6 pages
It Works in Practice 083
No ratings yet
It Works in Practice 083
2 pages
Examen Final Final
No ratings yet
Examen Final Final
3 pages
Wish
100% (1)
Wish
4 pages
Peace in The Midst of Storm
No ratings yet
Peace in The Midst of Storm
4 pages
Teo en Ming's Pfsense 2.4.4-p1 Firewall Installation Manual Version 1.0
No ratings yet
Teo en Ming's Pfsense 2.4.4-p1 Firewall Installation Manual Version 1.0
101 pages
Modul Ajar-Environment (Grade 1)
No ratings yet
Modul Ajar-Environment (Grade 1)
6 pages
New Asynchronous Fifo Design
No ratings yet
New Asynchronous Fifo Design
11 pages
Preludio Al Gallo Mañanero
No ratings yet
Preludio Al Gallo Mañanero
10 pages
Tanzimul Ummah International Tahfiz School: Time Activities Resources
No ratings yet
Tanzimul Ummah International Tahfiz School: Time Activities Resources
6 pages
Ui Ux Unit 1 Notes
No ratings yet
Ui Ux Unit 1 Notes
30 pages
SV Primecollect en
No ratings yet
SV Primecollect en
22 pages
Resume of Kanimozhi Promantus
No ratings yet
Resume of Kanimozhi Promantus
4 pages
Present Simple: Fact: Habit and Routine Schedule Feeling and Emotions
No ratings yet
Present Simple: Fact: Habit and Routine Schedule Feeling and Emotions
4 pages
PDF Maker 1744914264423
No ratings yet
PDF Maker 1744914264423
25 pages
Procurement Family Update 12.1.3
No ratings yet
Procurement Family Update 12.1.3
8 pages