0% found this document useful (0 votes)
37 views19 pages

CKnowEdit A New Chinese Knowledge Editing Dataset For

CKnowEdit is the first Chinese knowledge editing dataset aimed at correcting linguistic, factual, and logical errors in large language models (LLMs). It addresses the unique challenges posed by the Chinese language, including its complexity, culture-laden facts, and language-specific logic. The dataset is constructed from diverse sources and is designed to improve LLMs' understanding of Chinese by evaluating their performance on various knowledge editing tasks.

Uploaded by

Tiến Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views19 pages

CKnowEdit A New Chinese Knowledge Editing Dataset For

CKnowEdit is the first Chinese knowledge editing dataset aimed at correcting linguistic, factual, and logical errors in large language models (LLMs). It addresses the unique challenges posed by the Chinese language, including its complexity, culture-laden facts, and language-specific logic. The dataset is constructed from diverse sources and is designed to improve LLMs' understanding of Chinese by evaluating their performance on various knowledge editing tasks.

Uploaded by

Tiến Anh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

CKnowEdit: A New Chinese Knowledge Editing Dataset for

Linguistics, Facts, and Logic Error Correction in LLMs

Jizhan Fang1, * Tianhe Lu1, * Yunzhi Yao1, *


Ziyan Jiang1 Xin Xu2 Ningyu Zhang1,† Huajun Chen1,†
1 2
Zhejiang University University of California, San Diego
{fangjizhan, yyztodd, zhangningyu}@zju.edu.cn [email protected]

Abstract tinct challenges for LLMs versus Indo-European


languages (Luelsdorff, 1994; Matthiessen, 2023;
Chinese, as a linguistic system rich in depth
and complexity, is characterized by distinc- Xu et al., 2023): (i) Linguistic Complexity: Char-
arXiv:2409.05806v3 [cs.CL] 24 Feb 2025

tive elements such as ancient poetry, proverbs, acters intricately blend shape, sound, and mean-
idioms, and other cultural constructs. How- ing through composition and contextual pronuncia-
ever, current Large Language Models (LLMs) tion shifts, while flexible grammar and cultural ele-
face limitations in these specialized domains, ments (poetry, idioms, etc.) evolved over millennia.
highlighting the need for the development of (ii) Culture-Laden Facts: Untranslatable con-
comprehensive datasets that can assess, con- texts in specific facts like geographical/historical
tinuously update, and progressively improve
terms. (iii) Language-Specific Logic: Context-
these culturally-grounded linguistic competen-
cies through targeted training optimizations. To dependent reasoning patterns that rely on implicit
address this gap, we introduce CKnowEdit, the connectors and topic prominence over subject-
first-ever Chinese knowledge editing dataset predicate structures, often leading to misalignment
designed to correct linguistic, factual, and log- in logical chain extraction.
ical errors in LLMs. We collect seven types In this work, we propose to correct Chinese
of knowledge from a wide range of sources, knowledge errors in LLMs via knowledge editing
including classical texts, idioms, and content
(Yao et al., 2023; Wang et al., 2023b; Zhang et al.,
from Baidu Tieba Ruozhiba, taking into ac-
count the unique polyphony, antithesis, and 2024a; Hu et al., 2024; Ni et al., 2023; Wei et al.,
logical structures inherent in the Chinese lan- 2024b; Wang et al., 2024c; Padmanabhan et al.,
guage. By analyzing this dataset, we highlight 2023; Qiao et al., 2024; Chen et al., 2024; Li et al.,
the challenges current LLMs face in master- 2024; Hase et al., 2024; Wu et al., 2024a; Wang
ing Chinese. Furthermore, our evaluation of et al., 2024c). Nevertheless, current research on
state-of-the-art knowledge editing techniques knowledge editing predominantly concentrates on
reveals opportunities to advance the correction English-language factual knowledge (Cao et al.,
of Chinese knowledge1 .
2021; Meng et al., 2022; Wu et al., 2024b), derived
from Wikipedia, which introduces an Anglo-centric
1 Introduction bias. Recently, there have been some multilingual
The reliance on static training data and the lack datasets (Wang et al., 2023a; Xie et al., 2024; Wei
of explicit knowledge representation in Large Lan- et al., 2024a; Nie et al., 2024) attempting to explore
guage Models often lead to issues such as hallu- editing methods for different languages. However,
cinations, bias, and offensive outputs (Zhao et al., these datasets are often created by translating the
2023; Huang et al., 2023a; Liu et al., 2023; Sun English corpus into another language, and trans-
et al., 2024b). These limitations become particu- lation (Vanmassenhove et al., 2019; Berman and
larly pronounced when LLMs operate in complex Venuti, 2021) has been shown failing to capture the
domains or languages, such as Chinese. As shown intricate linguistic features and cultural nuances
in Figure 1, Chinese is a highly complex and lin- inherent to special language, resulting in a loss of
guistically unique system and presents three dis- lexical richness and diversity. Meanwhile, these
works are primarily designed to assess the coher-
* Equalcontribution and shared co-first authorship.
† ence of current editing methods between differ-
Corresponding Author.
1
Code and dataset are available at https://github.com/ ent languages and are not suitable for research on
zjunlp/EasyEdit. language-specific (a.k.a. Chinese) knowledge edit-
Figure 1: Examples of data from each subcategory in CKnowEdit, with detailed explanations provided in §2.

ing methods or for understanding LLMs’ represen- rely on token/logit-level measurements through
tation of specific languages. teacher-forcing automation (Yao et al., 2023), we
To help address the three major challenges men- implement open-ended text generation to evalu-
tioned above and mitigate some existing deficien- ate edited models under more realistic and demand-
cies in the current editing datasets, we construct ing conditions and utillize LLM-as-a-judge para-
a new Chinese dataset, CKnowEdit, which takes diam to effectively evaluate. The results demon-
into account language-specific characteristics, en- strate the challenges presented by the dataset and
suring that the data is not only linguistically ac- underscore the need for more sophisticated Chinese
curate but also culturally matched. To ensure the knowledge editing approaches in the future. Our
quality and diversity of CKnowEdit, we collect major contributions are as follows:
data from a variety of sources, including classical
literature, modern colloquialisms, and Baidu Tieba • We propose a new knowledge editing dataset,
Ruozhiba (Bai et al., 2024) (a popular Chinese on- CKnowEdit, which is uniquely characterized
line forum renowned for its abundance of logic by its Chinese linguistic features and cul-
puzzles and brainteasers, highly suitable for eval- tural depth, comprehensively exploring the
uating the reasoning capabilities). As a result, we language’s distinctiveness and the challenges
organize CKnowEdit into 3 major categories, in- it poses to LLMs from three perspectives.
cluding Linguistic, Facts and Logic corresponding • We report the empirical results of recent
to the three major challenges and 10 subcategories, knowledge editing baselines on CKnowEdit,
as shown in Figure 1. revealing their limitations when applied to
To benchmark the effectiveness of knowledge Chinese literature.
editing methods on CKnowEdit, we evaluate five
representative methods on four models. Departing • We further explore the challenges of Chinese
from traditional knowledge editing evaluations that knowledge editing and the struggles faced by
existing models in understanding Chinese lan- of many idioms can be entirely opposite to their
guage and culture. literal interpretation, such as the idiom
which literal meaning is July’s flowing fire contrary
2 Criteria for Knowledge Sourcing to the true meaning. LLMs’ statistical learning
2.1 Chinese Linguistics paradigms struggle to resolve these interpretative
gaps, particularly when processing idioms whose
Chinese linguistics studies the phonetics, vocabu- surface forms actively contradict their established
lary, semantics and grammar of the Chinese lan- semantic values in linguistic praxis.
guage, the linguistic knowledge in CKnowEdit is
categorized into five subtypes. Each subtype of Proverb Proverbs often use modern expressions
knowledge presents unique challenges for LLMs. with clear literal meanings, but their actual signifi-
cance usually depends on metaphorical understand-
Pinyin Pinyin Notation serves as the official ro- ing. While these proverbs maintain consistent core
manization system for Standard Mandarin Chi- meanings, LLMs struggle to apply them appropri-
nese, utilizing the Latin alphabet to represent Chi- ately across different real-life situations.
nese characters phonetically. In Chinese, the phe-
nomenon of polyphonic characters is widespread. 2.2 Factual Knowledge
As shown in Figure 1, the character ‘ ’ (six) is pro-
History and Geographical knowledge in
nounced ‘Liù’ in most cases, but in ‘ ’ (a city)
CKnowEdit covers key events and historical
it is pronounced ‘Lù.’ This inherent ambiguity in
figures, regional landscapes, and unique local cul-
grapheme-phoneme mapping poses challenges for
tures across China. However, mainstream LLMs
LLMs, especially when dealing with low-frequency
demonstrate notable gaps in their understanding of
characters with multiple pronunciations, which are
factual knowledge related to China’s history and
also included in CKnowEdit.
geography (Sun et al., 2024a).
Ancient Poetry Ancient Poetry constitutes an es-
2.3 Chinese language-specific logic trap
sential component of Chinese classical literature,
which significantly differs from Modern Vernacu- Phonetic Misunderstand Figure 1 demonstrates
lar Chinese, particularly in semantic constructs and a typical Chinese phonetic misunderstanding in-
graphological conventions. Additionally, Ancient volving the polyphonic character ‘ ’. When pro-
Poetry adhere to extremely strict requirements for nounced as ‘zhǎng’, it combines with the preceding
format and rhythm, where every character must be ‘ ’ to form ‘ ’ (team leader), suggesting the
precise and cannot be altered or omitted. This form illogical meaning ‘The vaccinated team leader has
of ancient language commonly embedded in the died’. However, ‘ ’ actually functions as an ad-
parameters of large language models, poses a sig- jective meaning ‘was long’ which pronounced as
nificant challenge to their memory and processing ‘cháng’, and ‘ ’ simply means ‘queue", indicat-
capabilities. ing that ‘Today’s vaccination queue was extremely
long’. This highlights how LLMs’ pronunciation
Classical Chinese Words in Classical Chinese disambiguation failures can lead to semantic misin-
often carry greatly different meanings compared terpretations, even with proper word segmentation.
to Modern Chinese. And the same character may
represent distinct concepts based on context. As Reasoning Error When meeting complex rea-
shown in Figure 1, the ‘ ’ ( means ‘safety’ in Mod- soning tasks in the Chinese language, LLMs may
ern Chinese ) can denote ‘to nurture’, ‘to stabilize’ commit reasoning errors, hence CKnowEdit has
or function as an interrogative term (‘where/how’) incorporated such data into its considerations.
in classical texts. This semantic divergence poses
Wordplay This type of logical fallacy often
unique challenges for language models trained on
arises from word segmentation errors or ambiguous
Modern Chinese data, particularly when process-
terms being misinterpreted as unintended mean-
ing context-sensitive interpretations of polysemous
ings, thereby distorting the original semantic con-
characters in classical literature.
tent of the textual components within a sentence.
Idiom Directly comprehending Chinese idioms As illustrated in Example 1, the LLM misinter-
or interpreting them literally often leads to a loss preted (Bluetooth earphones) through er-
of their true meaning. In fact, the actual meaning roneous word segmentation as ‘ - ’ (literally
Figure 2: Overview of CKnowEdit construction. A full sample of CKnowEdit is shown in Figure 7 and 8.

process ensures CKnowEdit remains challenging


and justifies the necessity of applying knowledge
editing techniques. To maintain data quality, we
conduct a manual review of all collected responses.

3.2 Data Annotation


Prompt-target Construction The queries after
Figure 3: The statistics of CKnowEdit. filtering are used as the prompt field in the data.
But for target field: fixed/data-provided answers
‘blue tooth-ear device’), forcing a literal interpreta- are used directly; open-ended explanations (e.g., in-
tion within a physiological context (teeth and ears), terpret logic errors) are generated by GPT-4. More-
ultimately producing semantically absurd outputs. over, all answers generated by GPT-4 undergo
meticulous manual review and correction to ensure
3 The Construction of CKnowEdit their accuracy.

3.1 Data Preprocess In-scope Construction Effective model editing


Data Collection As described in §2, we classify requires consistent behavioral adjustments across
the data types in CKnowEdit into 3 major cate- all examples within the editing scope. Beyond
gories and 10 subcategories, as also illustrated in correcting the primary target knowledge, related
Figure 1. Data collection is conducted based on in-scope information conveying similar concepts
this classification. We crawled authentic and di- should also be updated. We therefore assess
verse Chinese corpora and collected 11,981 raw two distinct generalization capabilities: weak and
data entries initially. All our data collectors ad- robust generalization. Specifically, we evaluate
here to the copyright and licensing terms of each the weak generalization effect by rephrasing the
data source website and all the data collected are prompt, such as rephrasing ‘Please complete the
freely available for academic research. Detailed following ancient poem’ as ‘The next line of the
addresses of each source website can be found in following ancient poem is...’. Robust generaliza-
Appendix A. tion is measured through two approaches: (i) Con-
text Transfer: This involves transferring the same
Data Filtering As shown in Figure 2, we first knowledge or language pattern to a different ap-
convert all the collected raw data into queries and plication scenario to see if the edited model has
pose them to the Qwen-7b-chat (Yang et al., 2024) truly learned the knowledge. For example, in
model as a baseline. We then retain only those classical Chinese, the character ‘ ’ means ‘nur-
questions that the model answered incorrectly, dis- ture’ in the phrase ‘ ’ (provides sustenance).
carding those it answered correctly. The filtering We then ask the edited model about the mean-
ing of ‘ ’ in ‘ ’ (the elderly are supported) tators. A field is considered acceptable only if both
where it still means ‘nurture’. (ii) Logical Single- annotators’ conclusions matched our own, and they
Hop: We present the edited model with a question identified no issues with the field both the question
that requires one additional reasoning step beyond and the ground truth. (5) Resolution of Discrep-
the original prompt. For example, if the original ancies: For any fields that failed at the step 4, the
prompt is ‘Please complete the following ancient authors discussed whether to retain, discard, or cor-
poem A’ (with the correct answer being B), the rect them, depending on the nature of the identified
portability field would then be ‘What is the line issues.
before B?’
4 Experiments
Out-scope Construction A successful edit
4.1 Experiment Settings
should adjust the targeted knowledge locally while
leaving unrelated knowledge unaffected. However, Models and Editing Methods To better evalu-
current approaches that modify internal model pa- ate the editing effectiveness on CKnowEdit, we
rameters often introduce knowledge conflicts and select 4 advanced LLMs that are widely used in
distortions (Li et al., 2023). Unlike other knowl- the Chinese community: Qwen-7B-Chat, Qwen2-
edge editing datasets that adopt entirely unrelated 7B-Instruct (Yang et al., 2024), DeepSeek-LLM-
knowledge for locality evaluation, we construct our 7B-Chat (DeepSeek-AI, 2024) and Baichuan2-7B-
locality field by selecting somewhat related knowl- Chat (Baichuan, 2023). Among them, Qwen-7B-
edge (e.g., sharing the same subject) to the target Chat is the original model used for data collec-
knowledge but containing distinct factual informa- tion, providing baseline performance. We inves-
tion. This approach provides stricter evaluation of tigate diverse model editing methods, including
editing side-effects while posing greater challenges FT-M (Zhang et al., 2024a), AdaLoRA (Zhang
for language models. et al., 2023), ROME, GRACE and AlphaEdit (Fang
et al., 2024). All the experiments are conducted by
3.3 Dataset Statistics EasyEdit (Wang et al., 2024b). All models are de-
Finally, we distilled 1,854 samples from 11,981 raw ployed and edited on 1 to 2 NVIDIA A800 GPUs.
entries to form CKnowEdit. Regarding the three Evaluation Unlike conventional knowledge edit-
main knowledge classifications in CKnowEdit, the ing evaluations that use token/logit-level metrics
largest proportion is attributed to linguistic data with teacher-forcing automation (Yao et al., 2023),
accounts for 48.40% and the Logic reasoning data we adopt open-ended text generation to assess
accounts for 45.63% because we found that knowl- edited models in more practical and challenging
edge that is highly characteristic of the Chinese scenarios. While some studies (Deng et al., 2024)
language poses significant challenges for current under similar setups use metrics like ROUGE-L or
LLMs. The specific quantity and proportion of semantic similarity, we find these metrics often fail
each data category are shown in the Figure 3. to reflect true text quality. For instance, ROUGE-L
is heavily skewed by text length—shorter reference
3.4 Quality Assurance texts paired with longer model outputs lead to un-
After constructing CKnowEdit, we implement a reliable scores.
comprehensive quality assurance process to en- Inspired by MT-Bench (Zheng et al., 2023b)
sure data reliability. We hire professional NLP which reveals that strong LLM judges like GPT-4o
annotators to review all the fields within the can align closely with human preferences, We cus-
dataset. The quality assurance process involved tomize prompts and evaluation processes for each
five steps: (1) Task Setup: The dataset is split into knowledge category’s unique characteristics, en-
3 fields—prompt-target, generalization, and local- abling GPT-4o to serve as evaluator. For each eval-
ity—each assigned to separate teams. (2) Team uation metric, we provide GPT-4o (gpt-4o-2024-
Training: Team members are trained to understand 08-06) with the corresponding question, edited
their assigned field’s purpose and follow standard- model’s response, and the reference answer. GPT-
ized review workflows. (3) Guideline Calibration: 4o then assigns a score from 1 to 10 based on the
We conduct a trial review on a random 20% of data relevance between the model’s response and the ref-
to fine-adjust the review process. (4) Dual Review: erence answer. For detailed evaluation procedures
Each field is independently reviewed by two anno- and templates, refer to Figure 9 to 13.
Figure 4: Main results. We do not report the locality of Ancient Poetry, Proverbs, Idioms and Facts Knowledge
because it is challenging to find out-scope knowledge that is both relevant to and distinct from the target knowledge
when we construct the locality field.

Metrics We employ 4 key knowledge editing tion and Portability metrics, AdaLoRA dominates
evaluation metrics: (1) Edit Success (ES) : This with nearly 70% and 86% top scores, respectively,
metric measures how well the edits align LLMs’ while AlphaEdit consistently performs subopti-
responses with the expected outcomes. (2) Gen- mally. These results demonstrate that AdaLoRA
eralization (Gen) : The metric helps to assess the achieves the best editing performance, contrasting
weak generalization of the editing. (3) Portabil- with prior findings (Zhang et al., 2024a).
ity (Por) : This measures the model’s capability
to apply corrected knowledge to new but related We believe the reason is that CKnowEdit’s fo-
prompts, assessing the robust generalization of the cus on editing long-text patterns and evaluating
editing across contexts. (4) Locality (Loc) : This long-text generation differs fundamentally from
metric ensures that edits do not inadvertently affect prior studies. Traditional approaches like ROME
unrelated areas of the model’s knowledge base. edit models via localized parameter tweaks to
precisely overwrite single factual knowledge as
discrete triplet (s-r-o). While effective for closed-
4.2 Main Results
form tasks (e.g., token-level teacher forcing eval-
Methods Comparison AdaLoRA achieves the uation task), this approach disrupts the generative
highest Edit Success in over 70% of cases across distribution needed for coherent open-ended text.
4 models, outperforming AlphaEdit and FT-M, In contrast, AdaLoRA adaptively adjusts multi-
which excel in 4 and 3 instances respectively ple modules (like attention heads and FFN lay-
but remain suboptimal overall. For Generaliza- ers), allowing the model to implicitly learn task-
Figure 5: The format of the indicators in the figure is: data type-matric, for example, Lin-ES (linguistics-ES)
represents the editing success rate of the language in the linguistic data category. The results of ROME are shown in
the Figure 14.

specific patterns (e.g., long-range dependencies). ples were first translated into English, then edited
By holistically adjusting parameters linked to target using AdaLoRA and ROME on four baseline mod-
knowledge, AdaLoRA preserves contextual con- els. The results were then translated back into
sistency, aligning edits with the broader language Chinese and evaluated. The AdaLoRA results are
generation process. shown in Figure 5.
Data Types Comparison The editing perfor- It can be observed that in linguistic knowledge
mance on the Ancient Poetry is notably poor across editing tasks, the results of English editing dif-
all knowledge types, especially for Portability, fer significantly from those of Chinese editing,
where almost all models and methods achieve often failing to produce precise edits. This is
scores below 1. As described in §2, Chinese ancient because the literal translation of Chinese linguis-
poetry poses significant challenges to the memo- tic knowledge into English frequently loses the
rization capabilities of LLMs. This stems from original meaning, aesthetic value, correct structure,
two linguistic specificities: (1) Rare characters: and language patterns, leading to significant devia-
Many obscure characters in poetry appear infre- tions between the model’s edited responses and the
quently in training data, leading to weak semantic correct answers. For example, in the case of classi-
representation and context modeling; (2) Distribu- cal poetry editing shown in Figure 6a), the model
tion shift: The syntactic structures and vocabulary can successfully edit the English target. However,
differ markedly from modern Chinese, making pat- when translating back into Chinese, current transla-
terns harder to capture. Combined, these factors tion software or LLMs generally learn the language
cause strong prior biases from modern Chinese dur- patterns of modern Chinese, thus unable to translate
ing next token prediction. When generating text a sentence of English back into classical poetry.
with modern-style prefixes or the current token is In factual tasks, the results of English editing
common in modern Chinese, models increasingly are generally on par with those of Chinese editing.
misalign subsequent token distributions. This aligns with intuition, as factual knowledge is
Additionally, the poor performance on Classical less dependent on the linguistic medium, and literal
Chinese highlights the need for more advanced translations do not significantly alter the intended
editing methods to handle its rich syntax, semantics, meaning.
and context-dependency, particularly in addressing In logical tasks, English editing performs even
nuances like polysemy and homophony, which are slightly better than Chinese editing. This is be-
less common in English. cause many logic traps unique to the Chinese lan-
guage, which are challenging for LLMs, are often
4.3 Why do we need an editing dataset that is lost during the translation process, reducing their
highly characteristic of Chinese? logical complexity in the English version.
The Irreplaceability of Chinese To better illus-
trate the unique characteristics of Chinese and its Language Functional Area Offset Similar to
irreplaceability in conveying Chinese knowledge, the human brain, neuron parameter regions for
we selected 100 data samples from each of the three different languages in LLMs often don’t overlap
knowledge categories in CKnowEdit. These sam- (Zhang et al., 2024b), creating natural barriers for
Figure 6: a) shows case where data is directly translated from Chinese to English, and the model’s responses is
translated back to Chinese. Part b) includes two cases that after editing target knowledge in English, query are asked
directly in Chinese to test cross-language generalization.

cross-language knowledge editing and generaliza- 2023b), GRACE (Hartvigsen et al., 2024), and
tion. Previous studies (Wang et al., 2023a) show WISE (Wang et al., 2024a) introduce extra train-
that when editing knowledge in English and testing able parameters. The locate-and-edit approaches
its generalization in Chinese, performance some- have to locate the relevant neurons and then mod-
times drops – even for factual knowledge where the ify those parameters. Representative studies are
English-Chinese gap is relatively small. As shown KN (Dai et al., 2022), ROME (Meng et al., 2022),
in Figure 6(b), our tests on Qwen2-7B-Instruct re- MEMIT (Meng et al., 2023) and NSE (Jiang et al.,
veal this limitation: the model struggles to general- 2024). Additionally, meta-learning approaches
ize English-edited knowledge to Chinese, whether utilize a hyper-network to generate the weights
for factual geography or linguistically complex for layers in LLMs, including KE (Cao et al.,
tasks. For instance, while the model correctly an- 2021), MEND (Mitchell et al., 2022a), and MAL-
swers classical poetry question in English, it fails MEN (Tan et al., 2023).
completely when the original question is posed.
5.2 Knowledge Editing Datasets
4.4 Human Evaluation Existing knowledge editing datasets have largely
To verify the effectiveness of our designed auto- centered on English-language texts, such as ZsRE
matic GPT-4 score for CKnowEdit evaluation, we (Cao et al., 2021), Counterfact (Meng et al., 2022),
randomly select 70 data samples from all knowl- KnowEdit (Zhang et al., 2024a), MQuAKE (Zhong
edge types, along with outputs from 4 baseline et al., 2023) and . Some research (Deng et al., 2024;
models for human evaluation by our contracted an- Rosati et al., 2024; Wu et al., 2024b) has also intro-
notators. From the human evaluation results, the duced the concept of evaluating knowledge editing
overall correlation coefficient across all 4 metrics through unstructured text and long-form content,
between the automatic and human evaluation is but these efforts have been predominantly limited
0.70, indicating a high consistency between GPT-4 to English. In a more inclusive direction, recent
scores and human preferences. academic initiatives have broadened the scope of
these datasets to include a multilingual dimension
5 Related Work (Xie et al., 2024; Wei et al., 2024a; Wu et al., 2024a;
Nie et al., 2024).
5.1 Knowledge Editing Methods
6 Conclusion
Current knowledge editing approaches can be cat-
egorized into two main types: preserving LMs’ In this work, we created a new, high-quality Chi-
parameters or modifying LMs’ parameters. Preser- nese knowledge editing dataset, CKnowEdit, which
vative methods incorporate external memory or ad- is rich in Chinese linguistic characteristics and lin-
ditional trainable parameters: SERAC (Mitchell guistic value. This dataset comprehensively evalu-
et al., 2022b) and IKE (Zheng et al., 2023a) lever- ates the performance of current mainstream editing
age a counterfactual model and a multi-fact prompt, methods on leading Chinese LLMs across three
respectively, as external working memory. Ca- knowledge types: linguistics, facts, and logic. Fur-
liNET (Dong et al., 2022), T-Patcher (Huang et al., thermore, we adopted an evaluation approach that
better aligns with real-world application require- in pretrained transformers. In Proceedings of the
ments. To date, most existing mothods and LLMs 60th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers), ACL
still can not edit the Chinese characteristic knowl-
2022, Dublin, Ireland, May 22-27, 2022, pages 8493–
edge well. 8502. Association for Computational Linguistics.

Limitations DeepSeek-AI. 2024. Deepseek llm: Scaling open-


source language models with longtermism. arXiv
Since the original intention of this work is to study preprint arXiv:2401.02954.
knowledge with distinctive Chinese linguistic char-
Jingcheng Deng, Zihao Wei, Liang Pang, Hanxing Ding,
acteristics, and Chinese linguistic knowledge and Huawei Shen, and Xueqi Cheng. 2024. Unke: Un-
Chinese logical knowledge better reflect these fea- structured knowledge editing in large language mod-
tures, the quantity of these two types of knowl- els. arXiv preprint arXiv:2405.15349.
edge in CKnowEdit is significantly greater than Qingxiu Dong, Damai Dai, Yifan Song, Jingjing Xu,
that of factual knowledge. This is a limitation of Zhifang Sui, and Lei Li. 2022. Calibrating factual
CKnowEdit. In addition, using GPT-4 to evaluate knowledge in pretrained language models. In Find-
the output of other LLMs is already a widely used ings of the Association for Computational Linguistics:
EMNLP 2022, Abu Dhabi, United Arab Emirates, De-
method. Although using GPT-4 to evaluate GPT-4
cember 7-11, 2022, pages 5937–5947. Association
may be biased, using GPT-4 to evaluate other mod- for Computational Linguistics.
els still has reference value. In addition, not only
for the tasks we proposed, but also for other tasks, Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan
Ma, Xiang Wang, Xiangnan He, and Tat seng Chua.
the community is still actively exploring how to ef- 2024. AlphaEdit: Null-Space Constrained Knowl-
fectively evaluate LLMs. As a temporary compro- edge Editing for Language Models. arXiv, page
mise, we remind readers that they should interpret 2410.02355.
the GPT-4 scores carefully. Tom Hartvigsen, Swami Sankaranarayanan, Hamid
Palangi, Yoon Kim, and Marzyeh Ghassemi. 2024.
Aging with grace: Lifelong model editing with dis-
References crete key-value adaptors. Advances in Neural Infor-
Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, mation Processing Systems, 36.
Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Peter Hase, Thomas Hofweber, Xiang Zhou, Elias
Zhang, Nuo Ma, Zekun Wang, Ruibin Yuan, Haihong Stengel-Eskin, and Mohit Bansal. 2024. Fundamen-
Wu, Hongquan Lin, Wenhao Huang, Jiajun Zhang, tal problems with model editing: How should ratio-
Wenhu Chen, Chenghua Lin, Jie Fu, Min Yang, Shi- nal belief revision work in llms? arXiv preprint
wen Ni, and Ge Zhang. 2024. Coig-cqia: Quality arXiv:2406.19354.
is all you need for chinese instruction fine-tuning.
Preprint, arXiv:2403.18058. Xinshuo Hu, Dongfang Li, Baotian Hu, Zihao Zheng,
Zhenyu Liu, and Min Zhang. 2024. Separate the
Baichuan. 2023. Baichuan 2: Open large-scale lan- wheat from the chaff: Model deficiency unlearn-
guage models. arXiv preprint arXiv:2309.10305. ing via parameter-efficient module operation. In
Antoine Berman and Lawrence Venuti. 2021. Transla- Thirty-Eighth AAAI Conference on Artificial Intel-
tion and the trials of the foreign. In The translation ligence, AAAI 2024, Thirty-Sixth Conference on In-
studies reader, pages 247–260. Routledge. novative Applications of Artificial Intelligence, IAAI
2024, Fourteenth Symposium on Educational Ad-
Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Edit- vances in Artificial Intelligence, EAAI 2014, Febru-
ing factual knowledge in language models. In Pro- ary 20-27, 2024, Vancouver, Canada, pages 18252–
ceedings of the 2021 Conference on Empirical Meth- 18260. AAAI Press.
ods in Natural Language Processing, EMNLP 2021,
Virtual Event / Punta Cana, Dominican Republic, 7- Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong,
11 November, 2021, pages 6491–6506. Association Zhangyin Feng, Haotian Wang, Qianglong Chen,
for Computational Linguistics. Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting
Liu. 2023a. A survey on hallucination in large lan-
Canyu Chen, Baixiang Huang, Zekun Li, Zhaorun guage models: Principles, taxonomy, challenges, and
Chen, Shiyang Lai, Xiongxiao Xu, Jia-Chen Gu, Jin- open questions. CoRR, abs/2311.05232.
dong Gu, Huaxiu Yao, Chaowei Xiao, Xifeng Yan,
William Yang Wang, Philip Torr, Dawn Song, and Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou,
Kai Shu. 2024. Can editing llms inject harm? CoRR, Wenge Rong, and Zhang Xiong. 2023b. Transformer-
abs/2407.20224. patcher: One mistake worth one neuron. In The
Eleventh International Conference on Learning Rep-
Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao resentations, ICLR 2023, Kigali, Rwanda, May 1-5,
Chang, and Furu Wei. 2022. Knowledge neurons 2023. OpenReview.net.
Houcheng Jiang, Junfeng Fang, Tianyu Zhang, Shankar Padmanabhan, Yasumasa Onoe, Michael J. Q.
An Zhang, Ruipeng Wang, Tao Liang, and Xiang Zhang, Greg Durrett, and Eunsol Choi. 2023. Propa-
Wang. 2024. Neuron-level sequential editing for gating knowledge updates to lms through distillation.
large language models. Preprint, arXiv:2410.04045. In Advances in Neural Information Processing Sys-
tems 36: Annual Conference on Neural Information
Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Processing Systems 2023, NeurIPS 2023, New Or-
Shangqing Liu, Wenhan Wang, Tianwei Zhang, and leans, LA, USA, December 10 - 16, 2023.
Yang Liu. 2024. Badedit: Backdooring large lan-
guage models by model editing. In The Twelfth In- Shanbao Qiao, Xuebing Liu, and Seung-Hoon Na. 2024.
ternational Conference on Learning Representations, Distillmike: Editing distillation of massive in-context
ICLR 2024, Vienna, Austria, May 7-11, 2024. Open- knowledge editing in large language models. In Find-
Review.net. ings of the Association for Computational Linguistics,
ACL 2024, Bangkok, Thailand and virtual meeting,
Zhoubo Li, Ningyu Zhang, Yunzhi Yao, Mengru Wang,
August 11-16, 2024, pages 7639–7654. Association
Xi Chen, and Huajun Chen. 2023. Unveiling the pit-
for Computational Linguistics.
falls of knowledge editing for large language models.
arXiv preprint arXiv:2310.02129. Domenic Rosati, Robie Gonzales, Jinkun Chen, Xuemin
Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Yu, Melis Erkan, Yahya Kayani, Satya Deepika
Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Chavatapalli, Frank Rudzicz, and Hassan Sajjad.
Muhammad Faaiz Taufiq, and Hang Li. 2023. Trust- 2024. Long-form evaluation of model editing. arXiv
worthy llms: A survey and guideline for evaluating preprint arXiv:2402.09394.
large language models’ alignment. arXiv preprint Jiaxing Sun, Weiquan Huang, Jiang Wu, Chenya Gu,
arXiv:2308.05374. Wei Li, Songyang Zhang, Hang Yan, and Con-
Philip A Luelsdorff. 1994. The Prague School of struc- ghui He. 2024a. Benchmarking chinese common-
tural and functional linguistics, volume 41. John sense reasoning of llms: From chinese-specifics
Benjamins Publishing. to reasoning-memorization correlations. Preprint,
arXiv:2403.14112.
Christian MIM Matthiessen. 2023. System in Systemic
Functional Linguistics: A system-based theory of Lichao Sun, Yue Huang, Haoran Wang, Siyuan Wu,
language. University of Toronto Press. Qihui Zhang, Chujie Gao, Yixin Huang, Wenhan
Lyu, Yixuan Zhang, Xiner Li, et al. 2024b. Trustllm:
Kevin Meng, David Bau, Alex Andonian, and Yonatan Trustworthiness in large language models. arXiv
Belinkov. 2022. Locating and editing factual associ- preprint arXiv:2401.05561.
ations in gpt. Advances in Neural Information Pro-
cessing Systems, 35:17359–17372. Chenmien Tan, Ge Zhang, and Jie Fu. 2023. Massive
editing for large language models via meta learning.
Kevin Meng, Arnab Sen Sharma, Alex J. Andonian, CoRR, abs/2311.04661.
Yonatan Belinkov, and David Bau. 2023. Mass-
editing memory in a transformer. In The Eleventh Eva Vanmassenhove, Dimitar Shterionov, and Andy
International Conference on Learning Representa- Way. 2019. Lost in translation: Loss and decay of lin-
tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. guistic richness in machine translation. In Proceed-
OpenReview.net. ings of Machine Translation Summit XVII: Research
Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Track, pages 222–232.
Finn, and Christopher D. Manning. 2022a. Fast Jiaan Wang, Yunlong Liang, Zengkui Sun, Yuxuan Cao,
model editing at scale. In The Tenth International and Jiarong Xu. 2023a. Cross-lingual knowledge
Conference on Learning Representations, ICLR 2022, editing in large language models. arXiv preprint
Virtual Event, April 25-29, 2022. OpenReview.net. arXiv:2309.08952.
Eric Mitchell, Charles Lin, Antoine Bosselut, Christo-
Peng Wang, Zexi Li, Ningyu Zhang, Ziwen Xu, Yunzhi
pher D Manning, and Chelsea Finn. 2022b. Memory-
Yao, Yong Jiang, Pengjun Xie, Fei Huang, and Hua-
based model editing at scale. In International Con-
jun Chen. 2024a. Wise: Rethinking the knowledge
ference on Machine Learning, pages 15817–15831.
memory for lifelong model editing of large language
PMLR.
models. Preprint, arXiv:2405.14768.
Shiwen Ni, Dingwei Chen, Chengming Li, Xiping Hu,
Ruifeng Xu, and Min Yang. 2023. Forgetting be- Peng Wang, Ningyu Zhang, Bozhong Tian, Zekun Xi,
fore learning: Utilizing parametric arithmetic for Yunzhi Yao, Ziwen Xu, Mengru Wang, Shengyu Mao,
knowledge updating in large language models. CoRR, Xiaohan Wang, Siyuan Cheng, Kangwei Liu, Yuan-
abs/2311.08011. sheng Ni, Guozhou Zheng, and Huajun Chen. 2024b.
EasyEdit: An easy-to-use knowledge editing frame-
Ercong Nie, Bo Shao, Zifeng Ding, Mingyang Wang, work for large language models. In Proceedings of
Helmut Schmid, and Hinrich Schütze. 2024. Bmike- the 62nd Annual Meeting of the Association for Com-
53: Investigating cross-lingual knowledge edit- putational Linguistics (Volume 3: System Demonstra-
ing with in-context learning. arXiv preprint tions), pages 82–93, Bangkok, Thailand. Association
arXiv:2406.17764. for Computational Linguistics.
Song Wang, Yaochen Zhu, Haochen Liu, Zaiyi Zheng, Natural Language Processing, EMNLP 2023, Sin-
Chen Chen, et al. 2023b. Knowledge editing for gapore, December 6-10, 2023, pages 10222–10240.
large language models: A survey. arXiv preprint Association for Computational Linguistics.
arXiv:2310.16218.
Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng
Yiwei Wang, Muhao Chen, Nanyun Peng, and Kai-Wei Wang, Shumin Deng, Mengru Wang, Zekun Xi,
Chang. 2024c. Deepedit: Knowledge editing as de- Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan
coding with constraints. CoRR, abs/2401.10471. Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang,
Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang,
Zihao Wei, Jingcheng Deng, Liang Pang, Hanxing Ding, Xiaowei Zhu, Jun Zhou, and Huajun Chen. 2024a. A
Huawei Shen, and Xueqi Cheng. 2024a. Mlake: Mul- comprehensive study of knowledge editing for large
tilingual knowledge editing benchmark for large lan- language models. CoRR, abs/2401.01286.
guage models. arXiv preprint arXiv:2404.04990.
Qingru Zhang, Minshuo Chen, Alexander Bukharin,
Zihao Wei, Liang Pang, Hanxing Ding, Jingcheng Deng, Pengcheng He, Yu Cheng, Weizhu Chen, and
Huawei Shen, and Xueqi Cheng. 2024b. Stable Tuo Zhao. 2023. Adaptive budget allocation for
knowledge editing in large language models. CoRR, parameter-efficient fine-tuning. In The Eleventh In-
abs/2402.13048. ternational Conference on Learning Representations.
Xiaobao Wu, Liangming Pan, William Yang Wang, and Zhihao Zhang, Jun Zhao, Qi Zhang, Tao Gui, and Xu-
Anh Tuan Luu. 2024a. AKEW: assessing knowledge anjing Huang. 2024b. Unveiling linguistic regions in
editing in the wild. In Proceedings of the 2024 Con- large language models. Preprint, arXiv:2402.14700.
ference on Empirical Methods in Natural Language
Processing, EMNLP 2024, Miami, FL, USA, Novem- Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
ber 12-16, 2024, pages 15118–15133. Association Xiaolei Wang, Yupeng Hou, Yingqian Min, Be-
for Computational Linguistics. ichen Zhang, Junjie Zhang, Zican Dong, Yifan Du,
Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao
Xiaobao Wu, Liangming Pan, William Yang Wang, and Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang
Anh Tuan Luu. 2024b. Updating language models Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen.
with unstructured facts: Towards practical knowledge 2023. A survey of large language models. CoRR,
editing. arXiv preprint arXiv:2402.18909. abs/2303.18223.
Jiakuan Xie, Pengfei Cao, Yuheng Chen, Yubo Ce Zheng, Lei Li, Qingxiu Dong, Yuxuan Fan, Zhiyong
Chen, Kang Liu, and Jun Zhao. 2024. Memla: Wu, Jingjing Xu, and Baobao Chang. 2023a. Can we
Enhancing multilingual knowledge editing with edit factual knowledge by in-context learning? In
neuron-masked low-rank adaptation. arXiv preprint Proceedings of the 2023 Conference on Empirical
arXiv:2406.11566. Methods in Natural Language Processing, EMNLP
Liang Xu, Anqi Li, Lei Zhu, Hang Xue, Changtai Zhu, 2023, Singapore, December 6-10, 2023, pages 4862–
Kangkang Zhao, Haonan He, Xuanwei Zhang, Qiyue 4876. Association for Computational Linguistics.
Kang, and Zhenzhong Lan. 2023. Superclue: A com- Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
prehensive chinese large language model benchmark. Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin,
arXiv preprint arXiv:2307.15020. Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang,
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Joseph E. Gonzalez, and Ion Stoica. 2023b. Judg-
Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan ing llm-as-a-judge with mt-bench and chatbot arena.
Li, Dayiheng Liu, Fei Huang, Guanting Dong, Hao- Preprint, arXiv:2306.05685.
ran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian
Zexuan Zhong, Zhengxuan Wu, Christopher D Man-
Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin
ning, Christopher Potts, and Danqi Chen. 2023.
Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang
Mquake: Assessing knowledge editing in language
Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang,
models via multi-hop questions. arXiv preprint
Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng
arXiv:2305.14795.
Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin,
Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu,
Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng,
Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin
Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang
Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu
Cui, Zhenru Zhang, and Zhihao Fan. 2024. Qwen2
technical report. arXiv preprint arXiv:2407.10671.
Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng,
Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu
Zhang. 2023. Editing large language models: Prob-
lems, methods, and opportunities. In Proceedings
of the 2023 Conference on Empirical Methods in
A Data Source Website
This section provides a detailed overview of the
data types and their corresponding data source web-
sites:
(1) Ancient Poetry:
https://zhuanlan.zhihu.com/p/414484867
(2) Proverbs:
http://www.360doc.com/content/19/0218/
14/39098269_815762159.shtml,
http://www.360doc.com/content/19/0312/
16/5784427_820995624.shtml,
http://www.360doc.com/content/19/0126/
14/55773589_811408910.shtml
(3) Idioms:
https://zhuanlan.zhihu.com/p/599709230
(4) Pinyin notation:
https://zhuanlan.zhihu.com/p/599709230
(5) Classical Chinese:
https://zhuanlan.zhihu.com/p/
622859964,
https://www.bilibili.com/read/
cv20279857/,
https://wyw.hwxnet.com/search.do?wd=
%E9%84%99&x=0&y=0
(6) Geographical and History:
https://baijiahao.baidu.com/s?id=
1682950669904608106&wfr=spider&for=pc,
https://www.sohu.com/a/419822319_
100941,
https://www.jingyanben.com/
qitawendang/125282.html?page=1,
http://www.360doc.com/content/20/0613/
11/7254176_918223750.shtml
(7) Logic Error:
https://github.com/Leymore/ruozhiba,
https://docs.qq.com/sheet/
DUlZ6aURhamdwb1RO?tab=BB08J2
Model Knowledge Type (Pre-edit) FT-M AdaLoRA ROME GRACE AlphaEdit
Pinyin 1.22 / 0.76 / 0.68 / 8.53 6.33 / 6.08 / 6.27 / 8.55 9.52 / 8.90 / 7.51 / 5.66 7.22 / 7.16 / 6.20 / 7.37 6.18 / 5.64 / 5.73 / 8.06 6.75 / 6.82 / 5.88 / 6.30
Classical Chinese 2.93 / 3.52 / 3.53 / 5.96 3.71 / 3.79 / 4.23 / 6.26 6.72 / 6.33 / 5.64 / 3.99 2.88 / 3.72 / 3.26 / 4.61 2.81 / 3.77 / 3.28 / 6.05 6.31 / 6.82 / 4.11 / 5.94
Idiom 6.77 / 6.91 / 6.55 / - 6.70 / 6.66 / 6.79 / - 8.77 / 8.34 / 8.05 / - 8.48 / 8.18 / 7.19 / - 6.60 / 6.68 / 6.86 / - 9.12 / 8.79 / 7.97 / -
Qwen-7B-Chat Proverb 5.38 / 5.10 / 6.22 / - 5.31 / 5.51 / 6.36 / - 8.13 / 7.79 / 7.58 / - 7.85 / 7.73 / 7.02 / - 5.40 / 5.39 / 6.42 / - 8.38 / 8.36 / 7.50 / -
Ancient Poetry 2.10 / 1.63 / 0.54 / - 1.85 / 1.19 / 0.70 / - 7.35 / 6.13 / 0.26 / - 3.62 / 2.33 / 0.54 / - 1.49 / 1.29 / 0.49 / - 3.41 / 1.66 / 0.18 / -
Fact 2.88 / 3.20 / 3.91 / - 3.03 / 2.51 / 4.03 / - 7.33 / 6.50 / 5.61 / - 4.34 / 3.20 / 3.88 / - 3.17 / 2.94 / 3.81 / - 3.03 / 3.74 / 3.29 / -
Logic 4.59 / 4.81 / 5.30 / 7.09 5.63 / 5.78 / 6.29 / 6.94 8.22 / 7.28 / 6.93 / 7.19 5.43 / 4.95 / 5.77 / 6.32 5.56 / 5.67 / 6.21 / 6.96 5.83 / 5.13 / 6.25 / 6.97
Pinyin 1.75 / 1.19 / 1.02 / 8.04 6.24 / 6.58 / 3.09 / 0.83 8.80 / 8.59 / 7.57 / 4.22 6.80 / 7.29 / 6.10 / 6.91 7.03 / 6.14 / 6.20 / 8.05 6.22 / 6.59 / 6.25 / 7.16
Classical Chinese 4.87 / 5.42 / 5.25 / 6.92 7.42 / 7.57 / 6.51 / 0.91 6.61 / 7.55 / 6.24 / 3.13 7.77 / 7.13 / 5.66 / 6.01 4.58 / 5.56 / 5.29 / 7.06 8.51 / 7.77 / 5.69 / 6.42
Idiom 9.04 / 9.11 / 7.46 / - 6.80 / 7.16 / 5.27 / - 9.33 / 9.31 / 8.50 / - 8.12 / 8.01 / 7.60 / - 9.02 / 9.14 / 7.71 / - 8.58 / 8.30 / 7.91 / -
Qwen2-7B-Instruct Proverb 6.79 / 6.75 / 6.26 / - 7.33 / 7.68 / 6.35 / - 8.90 / 8.82 / 8.06 / - 7.85 / 7.53 / 7.45 / - 6.76 / 6.76 / 7.24 / - 7.70 / 7.68 / 7.51 / -
Ancient Poetry 4.84 / 2.10 / 0.79 / - 7.66 / 6.79 / 0.28 / - 8.69 / 7.94 / 0.65 / - 5.34 / 2.75 / 0.97 / - 4.84 / 2.10 / 1.03 / - 6.64 / 3.84 / 0.64 / -
Fact 4.31 / 4.31 / 4.91 / - 6.97 / 6.42 / 1.97 / - 7.73 / 7.33 / 6.48 / - 4.71 / 4.50 / 5.30 / - 4.30 / 4.23 / 4.75 / - 6.57 / 4.86 / 5.35 / -
Logic 5.06 / 5.00 / 5.04 / 8.08 7.13 / 5.13 / 4.11 / 3.00 9.36 / 8.29 / 7.71 / 7.78 7.55 / 7.32 / 7.24 / 7.70 7.12 / 7.10 / 7.41 / 7.78 7.88 / 7.49 / 7.53 / 7.91
Pinyin 1.00 / 0.72 / 0.16 / 5.76 7.47 / 6.57 / 4.54 / 2.78 8.02 / 8.04 / 5.61 / 3.62 5.30 / 5.01 / 4.32 / 5.35 5.12 / 4.94 / 4.14 / 4.95 5.20 / 5.27 / 4.55 / 5.21
Classical Chinese 2.88 / 3.51 / 3.25 / 6.31 4.19 / 4.03 / 3.47 / 5.03 4.29 / 4.51 / 3.90 / 6.50 4.40 / 4.03 / 3.25 / 4.99 5.12 / 4.94 / 4.14 / 4.95 5.40 / 5.44 / 3.68 / 6.04
Idiom 8.09 / 8.72 / 6.80 / - 9.27 / 9.06 / 7.11 / - 8.88 / 8.73 / 7.56 / - 8.76 / 7.56 / 7.33 / - 8.36 / 7.06 / 6.33 / - 9.03 / 8.94 / 7.97 / -
DeepSeek-LLM-7B-Chat Proverb 6.79 / 6.89 / 6.91 / - 8.38 / 8.33 / 7.56 / - 8.24 / 8.42 / 7.83 / - 8.37 / 8.36 / 7.35 / - 6.82 / 7.29 / 7.02 / - 8.18 / 8.41 / 7.75 / -
Ancient Poetry 2.02 / 1.86 / 0.43 / - 4.82 / 6.05 / 0.20 / - 8.77 / 7.48 / 0.34 / - 3.33 / 3.09 / 0.37 / - 2.34 / 1.70 / 0.23 / - 4.07 / 3.07 / 0.52 / -
Fact 2.63 / 1.89 / 3.21 / - 8.26 / 8.40 / 5.74 / - 6.43 / 6.54 / 5.99 / - 4.02 / 4.32 / 3.01 / - 2.51 / 2.80 / 3.21 / - 4.20 / 3.37 / 3.96 / -
Logic 4.39 / 4.56 / 4.10 / 7.62 7.25 / 6.34 / 5.94 / 6.02 8.43 / 7.36 / 7.33 / 7.66 6.72 / 6.63 / 6.98 / 5.36 6.38 / 6.35 / 7.06 / 7.56 7.07 / 6.67 / 7.04 / 7.61
Pinyin 0.32 / 0.07 / 0.04 / 5.30 5.42 / 4.14 / 5.24 / 4.07 8.69 / 5.69 / 6.57 / 4.03 4.92 / 2.61 / 5.43 / 5.09 5.20 / 2.84 / 5.02 / 5.39 3.33 / 2.35 / 3.25 / 3.27
Classical Chinese 2.76 / 3.03 / 2.85 / 5.68 8.34 / 8.13 / 6.41 / 1.95 5.65 / 5.78 / 4.06 / 4.20 1.74 / 2.64 / 1.90 / 2.78 2.55 / 3.14 / 2.91 / 5.82 7.44 / 6.81 / 3.57 / 4.71
Idiom 8.16 / 7.98 / 6.74 / - 7.98 / 8.08 / 6.84 / - 9.28 / 9.29 / 7.75 / - 7.60 / 6.24 / 6.60 / - 8.33 / 7.72 / 6.61 / - 7.94 / 7.15 / 6.71 / -
Baichuan2-7B-Chat Proverb 6.87 / 6.46 / 6.57 / - 7.38 / 6.94 / 6.47 / - 8.67 / 8.61 / 7.82 / - 7.54 / 7.74 / 6.71 / - 6.79 / 6.67 / 6.63 / - 8.30 / 7.78 / 6.46 / -
Ancient Poetry 1.78 / 1.52 / 0.22 / - 3.39 / 2.95 / 0.51 / - 7.51 / 7.00 / 0.45 / - 1.51 / 1.34 / 0.30 / - 1.61 / 1.40 / 0.19 / - 2.75 / 1.07 / 0.00 / -
Fact 2.25 / 2.86 / 3.28 / - 6.90 / 7.13 / 4.31 / - 8.19 / 7.57 / 5.66 / - 3.77 / 3.10 / 3.27 / - 2.21 / 2.75 / 3.22 / - 4.77 / 4.74 / 4.04 / -
Logic 4.62 / 4.93 / 5.17 / 7.00 5.36 / 5.39 / 6.14 / 6.76 6.42 / 5.94 / 6.39 / 6.97 5.63 / 5.31 / 4.09 / 6.98 4.65 / 4.71 / 5.96 / 6.80 5.93 / 5.03 / 5.96 / 7.07

Table 1: Results (Edit Success / Generalization / Portability / Locality) of pre-edit and post-edit with 4 knowl-
edge editing methods for LLMs. We color the metrics where the editing results in negative gains red. The
underlined numbers correspond to the best method for every metric.
Figure 7: An example from Classical Chinese type.
Figure 8: An example from Classical Chinese type(translated by English).
Figure 9: Evaluation process of CKnowEdit.

Figure 10: An example of evaluation.

Figure 11: An example of evaluation.


Figure 12: An example of evaluation prompt.
Figure 13: An example of evaluation prompt(translated by English).
Figure 14: . The results of ROME.

You might also like