CKnowEdit A New Chinese Knowledge Editing Dataset For
CKnowEdit A New Chinese Knowledge Editing Dataset For
tive elements such as ancient poetry, proverbs, acters intricately blend shape, sound, and mean-
idioms, and other cultural constructs. How- ing through composition and contextual pronuncia-
ever, current Large Language Models (LLMs) tion shifts, while flexible grammar and cultural ele-
face limitations in these specialized domains, ments (poetry, idioms, etc.) evolved over millennia.
highlighting the need for the development of (ii) Culture-Laden Facts: Untranslatable con-
comprehensive datasets that can assess, con- texts in specific facts like geographical/historical
tinuously update, and progressively improve
terms. (iii) Language-Specific Logic: Context-
these culturally-grounded linguistic competen-
cies through targeted training optimizations. To dependent reasoning patterns that rely on implicit
address this gap, we introduce CKnowEdit, the connectors and topic prominence over subject-
first-ever Chinese knowledge editing dataset predicate structures, often leading to misalignment
designed to correct linguistic, factual, and log- in logical chain extraction.
ical errors in LLMs. We collect seven types In this work, we propose to correct Chinese
of knowledge from a wide range of sources, knowledge errors in LLMs via knowledge editing
including classical texts, idioms, and content
(Yao et al., 2023; Wang et al., 2023b; Zhang et al.,
from Baidu Tieba Ruozhiba, taking into ac-
count the unique polyphony, antithesis, and 2024a; Hu et al., 2024; Ni et al., 2023; Wei et al.,
logical structures inherent in the Chinese lan- 2024b; Wang et al., 2024c; Padmanabhan et al.,
guage. By analyzing this dataset, we highlight 2023; Qiao et al., 2024; Chen et al., 2024; Li et al.,
the challenges current LLMs face in master- 2024; Hase et al., 2024; Wu et al., 2024a; Wang
ing Chinese. Furthermore, our evaluation of et al., 2024c). Nevertheless, current research on
state-of-the-art knowledge editing techniques knowledge editing predominantly concentrates on
reveals opportunities to advance the correction English-language factual knowledge (Cao et al.,
of Chinese knowledge1 .
2021; Meng et al., 2022; Wu et al., 2024b), derived
from Wikipedia, which introduces an Anglo-centric
1 Introduction bias. Recently, there have been some multilingual
The reliance on static training data and the lack datasets (Wang et al., 2023a; Xie et al., 2024; Wei
of explicit knowledge representation in Large Lan- et al., 2024a; Nie et al., 2024) attempting to explore
guage Models often lead to issues such as hallu- editing methods for different languages. However,
cinations, bias, and offensive outputs (Zhao et al., these datasets are often created by translating the
2023; Huang et al., 2023a; Liu et al., 2023; Sun English corpus into another language, and trans-
et al., 2024b). These limitations become particu- lation (Vanmassenhove et al., 2019; Berman and
larly pronounced when LLMs operate in complex Venuti, 2021) has been shown failing to capture the
domains or languages, such as Chinese. As shown intricate linguistic features and cultural nuances
in Figure 1, Chinese is a highly complex and lin- inherent to special language, resulting in a loss of
guistically unique system and presents three dis- lexical richness and diversity. Meanwhile, these
works are primarily designed to assess the coher-
* Equalcontribution and shared co-first authorship.
† ence of current editing methods between differ-
Corresponding Author.
1
Code and dataset are available at https://github.com/ ent languages and are not suitable for research on
zjunlp/EasyEdit. language-specific (a.k.a. Chinese) knowledge edit-
Figure 1: Examples of data from each subcategory in CKnowEdit, with detailed explanations provided in §2.
ing methods or for understanding LLMs’ represen- rely on token/logit-level measurements through
tation of specific languages. teacher-forcing automation (Yao et al., 2023), we
To help address the three major challenges men- implement open-ended text generation to evalu-
tioned above and mitigate some existing deficien- ate edited models under more realistic and demand-
cies in the current editing datasets, we construct ing conditions and utillize LLM-as-a-judge para-
a new Chinese dataset, CKnowEdit, which takes diam to effectively evaluate. The results demon-
into account language-specific characteristics, en- strate the challenges presented by the dataset and
suring that the data is not only linguistically ac- underscore the need for more sophisticated Chinese
curate but also culturally matched. To ensure the knowledge editing approaches in the future. Our
quality and diversity of CKnowEdit, we collect major contributions are as follows:
data from a variety of sources, including classical
literature, modern colloquialisms, and Baidu Tieba • We propose a new knowledge editing dataset,
Ruozhiba (Bai et al., 2024) (a popular Chinese on- CKnowEdit, which is uniquely characterized
line forum renowned for its abundance of logic by its Chinese linguistic features and cul-
puzzles and brainteasers, highly suitable for eval- tural depth, comprehensively exploring the
uating the reasoning capabilities). As a result, we language’s distinctiveness and the challenges
organize CKnowEdit into 3 major categories, in- it poses to LLMs from three perspectives.
cluding Linguistic, Facts and Logic corresponding • We report the empirical results of recent
to the three major challenges and 10 subcategories, knowledge editing baselines on CKnowEdit,
as shown in Figure 1. revealing their limitations when applied to
To benchmark the effectiveness of knowledge Chinese literature.
editing methods on CKnowEdit, we evaluate five
representative methods on four models. Departing • We further explore the challenges of Chinese
from traditional knowledge editing evaluations that knowledge editing and the struggles faced by
existing models in understanding Chinese lan- of many idioms can be entirely opposite to their
guage and culture. literal interpretation, such as the idiom
which literal meaning is July’s flowing fire contrary
2 Criteria for Knowledge Sourcing to the true meaning. LLMs’ statistical learning
2.1 Chinese Linguistics paradigms struggle to resolve these interpretative
gaps, particularly when processing idioms whose
Chinese linguistics studies the phonetics, vocabu- surface forms actively contradict their established
lary, semantics and grammar of the Chinese lan- semantic values in linguistic praxis.
guage, the linguistic knowledge in CKnowEdit is
categorized into five subtypes. Each subtype of Proverb Proverbs often use modern expressions
knowledge presents unique challenges for LLMs. with clear literal meanings, but their actual signifi-
cance usually depends on metaphorical understand-
Pinyin Pinyin Notation serves as the official ro- ing. While these proverbs maintain consistent core
manization system for Standard Mandarin Chi- meanings, LLMs struggle to apply them appropri-
nese, utilizing the Latin alphabet to represent Chi- ately across different real-life situations.
nese characters phonetically. In Chinese, the phe-
nomenon of polyphonic characters is widespread. 2.2 Factual Knowledge
As shown in Figure 1, the character ‘ ’ (six) is pro-
History and Geographical knowledge in
nounced ‘Liù’ in most cases, but in ‘ ’ (a city)
CKnowEdit covers key events and historical
it is pronounced ‘Lù.’ This inherent ambiguity in
figures, regional landscapes, and unique local cul-
grapheme-phoneme mapping poses challenges for
tures across China. However, mainstream LLMs
LLMs, especially when dealing with low-frequency
demonstrate notable gaps in their understanding of
characters with multiple pronunciations, which are
factual knowledge related to China’s history and
also included in CKnowEdit.
geography (Sun et al., 2024a).
Ancient Poetry Ancient Poetry constitutes an es-
2.3 Chinese language-specific logic trap
sential component of Chinese classical literature,
which significantly differs from Modern Vernacu- Phonetic Misunderstand Figure 1 demonstrates
lar Chinese, particularly in semantic constructs and a typical Chinese phonetic misunderstanding in-
graphological conventions. Additionally, Ancient volving the polyphonic character ‘ ’. When pro-
Poetry adhere to extremely strict requirements for nounced as ‘zhǎng’, it combines with the preceding
format and rhythm, where every character must be ‘ ’ to form ‘ ’ (team leader), suggesting the
precise and cannot be altered or omitted. This form illogical meaning ‘The vaccinated team leader has
of ancient language commonly embedded in the died’. However, ‘ ’ actually functions as an ad-
parameters of large language models, poses a sig- jective meaning ‘was long’ which pronounced as
nificant challenge to their memory and processing ‘cháng’, and ‘ ’ simply means ‘queue", indicat-
capabilities. ing that ‘Today’s vaccination queue was extremely
long’. This highlights how LLMs’ pronunciation
Classical Chinese Words in Classical Chinese disambiguation failures can lead to semantic misin-
often carry greatly different meanings compared terpretations, even with proper word segmentation.
to Modern Chinese. And the same character may
represent distinct concepts based on context. As Reasoning Error When meeting complex rea-
shown in Figure 1, the ‘ ’ ( means ‘safety’ in Mod- soning tasks in the Chinese language, LLMs may
ern Chinese ) can denote ‘to nurture’, ‘to stabilize’ commit reasoning errors, hence CKnowEdit has
or function as an interrogative term (‘where/how’) incorporated such data into its considerations.
in classical texts. This semantic divergence poses
Wordplay This type of logical fallacy often
unique challenges for language models trained on
arises from word segmentation errors or ambiguous
Modern Chinese data, particularly when process-
terms being misinterpreted as unintended mean-
ing context-sensitive interpretations of polysemous
ings, thereby distorting the original semantic con-
characters in classical literature.
tent of the textual components within a sentence.
Idiom Directly comprehending Chinese idioms As illustrated in Example 1, the LLM misinter-
or interpreting them literally often leads to a loss preted (Bluetooth earphones) through er-
of their true meaning. In fact, the actual meaning roneous word segmentation as ‘ - ’ (literally
Figure 2: Overview of CKnowEdit construction. A full sample of CKnowEdit is shown in Figure 7 and 8.
Metrics We employ 4 key knowledge editing tion and Portability metrics, AdaLoRA dominates
evaluation metrics: (1) Edit Success (ES) : This with nearly 70% and 86% top scores, respectively,
metric measures how well the edits align LLMs’ while AlphaEdit consistently performs subopti-
responses with the expected outcomes. (2) Gen- mally. These results demonstrate that AdaLoRA
eralization (Gen) : The metric helps to assess the achieves the best editing performance, contrasting
weak generalization of the editing. (3) Portabil- with prior findings (Zhang et al., 2024a).
ity (Por) : This measures the model’s capability
to apply corrected knowledge to new but related We believe the reason is that CKnowEdit’s fo-
prompts, assessing the robust generalization of the cus on editing long-text patterns and evaluating
editing across contexts. (4) Locality (Loc) : This long-text generation differs fundamentally from
metric ensures that edits do not inadvertently affect prior studies. Traditional approaches like ROME
unrelated areas of the model’s knowledge base. edit models via localized parameter tweaks to
precisely overwrite single factual knowledge as
discrete triplet (s-r-o). While effective for closed-
4.2 Main Results
form tasks (e.g., token-level teacher forcing eval-
Methods Comparison AdaLoRA achieves the uation task), this approach disrupts the generative
highest Edit Success in over 70% of cases across distribution needed for coherent open-ended text.
4 models, outperforming AlphaEdit and FT-M, In contrast, AdaLoRA adaptively adjusts multi-
which excel in 4 and 3 instances respectively ple modules (like attention heads and FFN lay-
but remain suboptimal overall. For Generaliza- ers), allowing the model to implicitly learn task-
Figure 5: The format of the indicators in the figure is: data type-matric, for example, Lin-ES (linguistics-ES)
represents the editing success rate of the language in the linguistic data category. The results of ROME are shown in
the Figure 14.
specific patterns (e.g., long-range dependencies). ples were first translated into English, then edited
By holistically adjusting parameters linked to target using AdaLoRA and ROME on four baseline mod-
knowledge, AdaLoRA preserves contextual con- els. The results were then translated back into
sistency, aligning edits with the broader language Chinese and evaluated. The AdaLoRA results are
generation process. shown in Figure 5.
Data Types Comparison The editing perfor- It can be observed that in linguistic knowledge
mance on the Ancient Poetry is notably poor across editing tasks, the results of English editing dif-
all knowledge types, especially for Portability, fer significantly from those of Chinese editing,
where almost all models and methods achieve often failing to produce precise edits. This is
scores below 1. As described in §2, Chinese ancient because the literal translation of Chinese linguis-
poetry poses significant challenges to the memo- tic knowledge into English frequently loses the
rization capabilities of LLMs. This stems from original meaning, aesthetic value, correct structure,
two linguistic specificities: (1) Rare characters: and language patterns, leading to significant devia-
Many obscure characters in poetry appear infre- tions between the model’s edited responses and the
quently in training data, leading to weak semantic correct answers. For example, in the case of classi-
representation and context modeling; (2) Distribu- cal poetry editing shown in Figure 6a), the model
tion shift: The syntactic structures and vocabulary can successfully edit the English target. However,
differ markedly from modern Chinese, making pat- when translating back into Chinese, current transla-
terns harder to capture. Combined, these factors tion software or LLMs generally learn the language
cause strong prior biases from modern Chinese dur- patterns of modern Chinese, thus unable to translate
ing next token prediction. When generating text a sentence of English back into classical poetry.
with modern-style prefixes or the current token is In factual tasks, the results of English editing
common in modern Chinese, models increasingly are generally on par with those of Chinese editing.
misalign subsequent token distributions. This aligns with intuition, as factual knowledge is
Additionally, the poor performance on Classical less dependent on the linguistic medium, and literal
Chinese highlights the need for more advanced translations do not significantly alter the intended
editing methods to handle its rich syntax, semantics, meaning.
and context-dependency, particularly in addressing In logical tasks, English editing performs even
nuances like polysemy and homophony, which are slightly better than Chinese editing. This is be-
less common in English. cause many logic traps unique to the Chinese lan-
guage, which are challenging for LLMs, are often
4.3 Why do we need an editing dataset that is lost during the translation process, reducing their
highly characteristic of Chinese? logical complexity in the English version.
The Irreplaceability of Chinese To better illus-
trate the unique characteristics of Chinese and its Language Functional Area Offset Similar to
irreplaceability in conveying Chinese knowledge, the human brain, neuron parameter regions for
we selected 100 data samples from each of the three different languages in LLMs often don’t overlap
knowledge categories in CKnowEdit. These sam- (Zhang et al., 2024b), creating natural barriers for
Figure 6: a) shows case where data is directly translated from Chinese to English, and the model’s responses is
translated back to Chinese. Part b) includes two cases that after editing target knowledge in English, query are asked
directly in Chinese to test cross-language generalization.
cross-language knowledge editing and generaliza- 2023b), GRACE (Hartvigsen et al., 2024), and
tion. Previous studies (Wang et al., 2023a) show WISE (Wang et al., 2024a) introduce extra train-
that when editing knowledge in English and testing able parameters. The locate-and-edit approaches
its generalization in Chinese, performance some- have to locate the relevant neurons and then mod-
times drops – even for factual knowledge where the ify those parameters. Representative studies are
English-Chinese gap is relatively small. As shown KN (Dai et al., 2022), ROME (Meng et al., 2022),
in Figure 6(b), our tests on Qwen2-7B-Instruct re- MEMIT (Meng et al., 2023) and NSE (Jiang et al.,
veal this limitation: the model struggles to general- 2024). Additionally, meta-learning approaches
ize English-edited knowledge to Chinese, whether utilize a hyper-network to generate the weights
for factual geography or linguistically complex for layers in LLMs, including KE (Cao et al.,
tasks. For instance, while the model correctly an- 2021), MEND (Mitchell et al., 2022a), and MAL-
swers classical poetry question in English, it fails MEN (Tan et al., 2023).
completely when the original question is posed.
5.2 Knowledge Editing Datasets
4.4 Human Evaluation Existing knowledge editing datasets have largely
To verify the effectiveness of our designed auto- centered on English-language texts, such as ZsRE
matic GPT-4 score for CKnowEdit evaluation, we (Cao et al., 2021), Counterfact (Meng et al., 2022),
randomly select 70 data samples from all knowl- KnowEdit (Zhang et al., 2024a), MQuAKE (Zhong
edge types, along with outputs from 4 baseline et al., 2023) and . Some research (Deng et al., 2024;
models for human evaluation by our contracted an- Rosati et al., 2024; Wu et al., 2024b) has also intro-
notators. From the human evaluation results, the duced the concept of evaluating knowledge editing
overall correlation coefficient across all 4 metrics through unstructured text and long-form content,
between the automatic and human evaluation is but these efforts have been predominantly limited
0.70, indicating a high consistency between GPT-4 to English. In a more inclusive direction, recent
scores and human preferences. academic initiatives have broadened the scope of
these datasets to include a multilingual dimension
5 Related Work (Xie et al., 2024; Wei et al., 2024a; Wu et al., 2024a;
Nie et al., 2024).
5.1 Knowledge Editing Methods
6 Conclusion
Current knowledge editing approaches can be cat-
egorized into two main types: preserving LMs’ In this work, we created a new, high-quality Chi-
parameters or modifying LMs’ parameters. Preser- nese knowledge editing dataset, CKnowEdit, which
vative methods incorporate external memory or ad- is rich in Chinese linguistic characteristics and lin-
ditional trainable parameters: SERAC (Mitchell guistic value. This dataset comprehensively evalu-
et al., 2022b) and IKE (Zheng et al., 2023a) lever- ates the performance of current mainstream editing
age a counterfactual model and a multi-fact prompt, methods on leading Chinese LLMs across three
respectively, as external working memory. Ca- knowledge types: linguistics, facts, and logic. Fur-
liNET (Dong et al., 2022), T-Patcher (Huang et al., thermore, we adopted an evaluation approach that
better aligns with real-world application require- in pretrained transformers. In Proceedings of the
ments. To date, most existing mothods and LLMs 60th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), ACL
still can not edit the Chinese characteristic knowl-
2022, Dublin, Ireland, May 22-27, 2022, pages 8493–
edge well. 8502. Association for Computational Linguistics.
Table 1: Results (Edit Success / Generalization / Portability / Locality) of pre-edit and post-edit with 4 knowl-
edge editing methods for LLMs. We color the metrics where the editing results in negative gains red. The
underlined numbers correspond to the best method for every metric.
Figure 7: An example from Classical Chinese type.
Figure 8: An example from Classical Chinese type(translated by English).
Figure 9: Evaluation process of CKnowEdit.