0% found this document useful (0 votes)
24 views8 pages

Can AI Write Classical Chinese Poetry Like Humans? An Empirical Study Inspired by Turing Test

This paper investigates whether AI can compose classical Chinese poetry as effectively as humans, proposing a novel evaluation framework called ProFTAP inspired by the Turing test. The study finds that recent large language models (LLMs) can produce poetry nearly indistinguishable from human authors, with some open-source models outperforming GPT-4. The results indicate that AI-generated poetry can achieve a level of creativity and sentiment that challenges the belief that machines cannot replicate human artistic expression.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views8 pages

Can AI Write Classical Chinese Poetry Like Humans? An Empirical Study Inspired by Turing Test

This paper investigates whether AI can compose classical Chinese poetry as effectively as humans, proposing a novel evaluation framework called ProFTAP inspired by the Turing test. The study finds that recent large language models (LLMs) can produce poetry nearly indistinguishable from human authors, with some open-source models outperforming GPT-4. The results indicate that AI-generated poetry can achieve a level of creativity and sentiment that challenges the belief that machines cannot replicate human artistic expression.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Can AI Write Classical Chinese Poetry like Humans?

An Empirical Study
Inspired by Turing Test
♣♢ ♡♢ ♣♡♢
Zekun Deng , Hao Yang and Jun Wang

Department of Information Management, Peking University

Institute of Artificial Intelligence, Peking University

Research Center for Digital Humanities, Peking University
{dzk,yanghao2008,junwang}@pku.edu.cn

Abstract of poetry certainly is not likely to convince every-


one. It is clear that a more objective, sound and
Some argue that the essence of humanity, such
as creativity and sentiment, can never be mim- convincing alternative is needed, which can be both
accepted by most people and conducted in practice.
arXiv:2401.04952v1 [cs.CL] 10 Jan 2024

icked by machines. This paper casts doubt on


this belief by studying a vital question: Can AI Therefore, in this paper, we propose a novel eval-
compose poetry as well as humans? To answer uation framework for evaluating AI’s poetry writ-
the question, we propose ProFTAP, a novel eval- ing capability. The framework is inspired by Turing
uation framework inspired by Turing test to as- test and relies on distinguishability to measure the
sess AI’s poetry writing capability. We apply it
poetry composing ability of AI. By taking classical
on current large language models (LLMs) and
find that recent LLMs do indeed possess the Chinese poetry as an exemplar, based on our exper-
ability to write classical Chinese poems nearly imental result, we argue that current large language
indistinguishable from those of humans. We models (LLMs) do indeed possess the ability to
also reveal that various open-source LLMs can write poems nearly indistinguishable from those of
outperform GPT-4 on this task. humans.
The main contributions of this paper are:
1 Introduction
Today’s world sees a fierce discourse on the poten- 1. We propose ProFTAP, a novel framework for
tiality of artificial intelligence transcending that of evaluating AI-generated poetry inspired by
humans (Chui et al., 2016; De Cremer and Kas- Turing test. ProFTAP is more objective and
parov, 2021). While studies show that artificial rigorous while easier to implement than prior
intelligence (AI) has already outdo humans on cer- manual methods.
tain tasks (Silver et al., 2017; Google, 2023), many
hold that the very essence of humanity, such as 2. We apply ProFTAP to popular LLMs and re-
creativity and sentiment, can never be mimicked veal their capabilities of classical Chinese po-
by machines (Millet et al., 2023). etry generation.
Among the areas which have not yet been dom-
3. We show that finetuned open-source LLM can
inated by AI is poetry. Poetry holds a profound
outperform GPT-4 on classical Chinese po-
significance in the realm of human art and civi-
etry generation and are able to write poems
lization. Serving as a vessel of creativity, poetry
nearly indistinguishable from those authored
encapsulates complex feelings and ideas in a con-
by ancient Chinese poets.
densed and evocative form, allowing for a deeper
exploration of the human condition. 2 Related Works
In response to the recent debate, this paper fo-
cuses on a vital question: Can AI compose poetry Prior works have utilized numerous methods for
as well as humans? However, this question is tricky evaluating AI-generated poetry. Many studies
to answer since there has hardly been an acknowl- adopt automatic approaches. Ormazabal et al.
edged definition of a good poem or a bad poem. In (2022) used the perplexity of language model. Liu
fact, from Aristotle’s Poetics to Confucius’ sayings et al. (2019) used BLEU to evaluate model fitness.
in Analects, the way to judge a poem has remained They also use Rhetoric F1 score to measure the
a subject of dispute throughout the history. Thus, rhetorically accuracy and Distinct-1/2 (Li et al.,
using existing methods to assess the good and bad 2016) to evaluate diversity. Deng et al. (2020) used
the similarity of sentence-level embeddings of sen- Therefore, we obtain a set of poems authored by
tences within a poem to reflect coherency. real humans and later force the AIs to generate
More researches prefer human evaluation. poems on each one of and only on the titles that we
Zhang and Lapata (2014) proposed to ask experts provide. Formally, a poem pi,j consists of a title tj
to rate the poems using a Likert 1–5 scale on four and the main text cj (j = 1, ...T ). The set of poems
by real humans is denoted P = {p∗,1 , ..., p∗,T }.

dimensions: fluency, coherence, meaning and po-
eticness. Numerous later works followed this ap- We collect the titles of the poems T = {t1 , ..., tT }.
proach (Chen et al., 2019; Van de Cruys, 2020; Step 2: Preparing AI models. AIs need to be
Deng et al., 2020). Yi et al. (2018) changed the prepared to generate desired form of output. For
poeticness dimension into aesthetics and added a LLMs, as an example, preparation may include
new dimension called topic relevance, which is fol- designing appropriate prompts.
lowed by Liu et al. (2019). Bena and Kalita (2019)
used similar scale to assess creativity in machine- Step 3: Generation. We use each model mi ∈
generated poems, but with different dimensions. M to generate the main text of a poem conditioned
Ormazabal et al. (2022) ask human to rank poems on title tj ∈ T . The resulting poem is denoted
written by computers, laymen and experts. Wang pi,j . The set of all generated poems are denoted
et al. (2016) adopted Feigenbaum Test to evaluate P = {p1,1 , ..., pM,T }.
machine-generated poems. Step 4: Post-processing and Anti-plagiarism.
Post-processing is conducted in case an AI does not
3 ProFTAP: An Evaluation Framework fully comply with the requirements. For instance,
for AI-generated Poetry the words explaining the poem are removed by
programs with certain rules. The generated poems
Whether a poem is good or not is a question without
are also matched against a ancient poem database
definitive answer. Despite numerous proposed cri-
to find plagiarism. Any poems with duplication are
teria, few have yet proven entirely fulfilling. There-
re-generated until no overlap can be found. The
fore, instead of trying to seek for a universally
poems after this step are still denoted P.
recognized standard of good poem, we resort to a
simple but much less controversial criterion: distin- Step 5: Human Judgement. A group of human
guishability. Rather than determining whether AI judges are recruited. The judges are instructed to
poems are as good as human’s, we shift focus to assign a probability (0.0 to 1.0) to each poem they
investigate whether AIs can generate poems akin to are given which indicates how likely this poem is
humans. Consequently, we can measure the poetry written by a real human, without referring to any
writing ability of AI to some extent without having other information and without the help of other
to decide how meritorious a poem is. people or computer. We mix real-human poems

Therefore, we propose a novel evaluation frame- and AI-generated ones together, i.e. U = P ∪ P.
work which we name as Probabilistic Feigenbaum We randomly shuffle U and distribute the poems
Test for AI-generated Poetry (ProFTAP). Feigen- randomly to human judges. Each poem is assigned
baum test (Feigenbaum, 2003) is a variation of to at least K different human judges. The mean
Turing test, which can be used where a computer probability value by different judges of a poem
tries to replicate a domain expert in a specific field. pi,j (i = ∗, 1, 2, .., M ) is denoted qi,j .
The core of our framework is based on Feigen- Step 6: Obtaining metric. We derive the Re-
baum test. The procedures of ProFTAP and its ceiver Operating Characteristic of human judges
differences from previous methods are stated in the with respect to distinguishing real-human poems
subsections below. with poems by each AI. We calculate the area un-
der the curve (AUC) corresponding to each model.
3.1 Procedures
The closer AUC is to 0.5, the more similar the
The procedures for applying ProFTAP to evaluate AI’s poems are to human’s. Additionally, we ap-
the poetry generation capability of AI models M = ply Wilcoxon signed-rank test (Wilcoxon, 1992) on
{m1 , ..., mM } are as follows. qi,j and q∗,j for each model mi ∈ M to evaluate
Step 1: Obtaining titles as conditions. AIs whether the probability difference between human-
are not expected to freely generate poems of any and AI-authored poems conditioned on the same
topic, since this may lead to unfair comparison. title is statistically significant.
3.2 Difference from Previous Evaluation we chose Qwen-72B-Chat (Bai et al., 2023), Yi-
3
Methods 34B-Chat , Qwen-14B-Chat (Bai et al., 2023) and
Our framework is more advantageous than previous ChatGLM3-6B (Du et al., 2022), covering various
4
methods in three aspects. Firstly, ProFTAP is more scales. We also evaluate Xunzi-Qwen-Chat , an
objective and rigorous. Prior human evaluation LLM continually pretrained on ancient Chinese
usually require judgements of subjective concepts, corpora based on Qwen-7B. We use a same set of
such as poeticness, aesthetics and coherence, and prompt and hyperparameters for all models. We
different people interpret these concepts differently. also tested Jiuge (Zhipeng et al., 2019), a reputable
Secondly, ProFTAP requires relatively smaller hu- AI system specialized in this task. See Appendix A
man effort. Compared with scoring four to five for more details on our experiments.
dimensions to assess a single poem, our framework In addition to out-of-the-box LLMs, we also
only needs one. Thirdly, ProFTAP is much more re- train a new LLM specialized in this task and in-
liable and interpretable than automatic approaches clude it in our evaluation. This model, which we
such as BLEU or perplexity. The involvement of name as Qwen-72B-Poet, is finetuned from Qwen-
human effort increases the soundness of evaluation. 72B-Chat-Int4 using Souyun data. Refer to Ap-
pendix B to see training details.
Although Wang et al. (2016) have used Feigen-
baum Test in evaluating machine poetry, our frame- 4.2 Results
work differs from it significantly in two ways. First,
ProFTAP is based on probability estimation of au- The AUC and result for Wilcoxon test is shown in
thorship rather than a yes/no answer, enabling more Table 1. Among all the models, Qwen-72B-Poet
informative and detailed statistical analysis. Fur- has the lowest AUC (0.541), which indicates that
thermore, our framework does not necessarily re- its poems are most similar to human’s. Xunzi-7B-
quire humans to compose new poems and can just Chat is the second best LLM and beats Qwen-72B-
use existing ones in database. In the case of classi- Chat, which shows the effectiveness of continual
cal Chinese poetry, this brings extra benefit because pretraining. GPT-4 has a much lower AUC than
classical poems composed nowadays are hardly on GPT-3.5, and Ernie-Bot-4.0 underperforms GPT-
par with ancient ones. 4. Although Jiuge is not based on LLM, it still
outperforms all prior LLMs we tested. Despite its
4 Experiments and Results larger scale, Yi-34B-Chat underperforms Qwen-
14B-Chat.
4.1 Human Judgements of Current LLMs The p of Qwen-72B-Poet is 0.058, which means
We employ ProFTAP to evaluate the classical Chi- that there is no statistically significant difference be-
nese poetry writing ability of major current LLMs. tween the poems by Qwen-72B-Poet and humans.
We recruited 13 human judges to conduct the In contrast, the p of all other models are no greater
evaluation. Most of them have higher education than 3e-4, which are statistically significant. The re-
background relevant to classical Chinese poetics. sults show that Qwen-72B-Poet can actually write
We randomly chose 110 poems (T = 110) from classical Chinese poetry nearly indistinguishable
1
Souyun , the largest public database of classical from those of humans, at least to some extent.
Chinese poetry to date. Since the database contains
5 Discussion
approximately 1 million poems, it is impossible for
any ordinary individual to memorize even a small We further investigate how explicit features of AI-
portion of them. Thus, we can rule out the possibil- generated poems might impact how they are valued.
ity that human judges might know by memory that We filter the generated poems according to 2 crucial
a randomly sampled poem is authored by human. factors: line length and character repetition. Since
5
We selected 10 AI models for assessment. For human-written poems usually have patterned line
proprietary LLMs, we chose GPT-3.5, GPT-4 (Ope- lengths while certain AI models often disregard
2
nAI, 2023) and Ernie-Bot-4.0 , which are often 3
https://huggingface.co/01-ai/Yi-34B-Chat
considered the best-performing models on many 4
https://modelscope.cn/models/Xunzillm4cc/
tasks in Chinese language. For open-source LLMs, Xunzi-Qwen-Chat
5
1
None of the criteria are strict in classical Chinese poetry,
https://sou-yun.cn and many highly-appreciated poems actually break these rules.
2
See more information on https://cloud.baidu.com. However, people do often use these criteria for judgement.
Model #Prm AUC W.T. p
Qwen-72B-Poet 72B 0.541 0.058
Jiuge N/A 0.632 3e-4
Xunzi-7B-Chat 7B 0.633 8e-5
Qwen-72B-Chat 72B 0.641 7e-5
GPT-4 N/A 0.670 2e-6
Ernie-Bot-4.0 N/A 0.732 <1e-8
Qwen-14B-Chat 14B 0.784 <1e-8
GPT-3.5 N/A 0.874 <1e-8
Yi-34B-Chat 34B 0.889 <1e-8
ChatGLM3-6B 6B 0.913 <1e-8 Figure 1: Ratio of poems authored by each model and
human in each category of yan.
Table 1: AUC and p-value of Wilcoxon test (W.T. p) of
each model. #Prm stands for number of parameters.
Model 5-yan 7-yan
Model Lin. Len. Cha. Rep.
Qwen-72B-Poet 0.499 0.538
Qwen-72B-Poet 0.525 ↓ 0.563 ↑
Jiuge 0.624 0.668
Jiuge 0.646 ↑ 0.698 ↑
Xunzi-7B-Chat 0.520 0.601
Xunzi-7B-Chat 0.578 ↓ 0.599 ↓
Qwen-72B-Chat 0.578 0.571
Qwen-72B-Chat 0.571 ↓ 0.610 ↓
GPT-4 0.538 0.663
GPT-4 0.623 ↓ -
Ernie-Bot-4.0 - 0.746
Ernie-Bot-4.0 0.741 ↑ -
Qwen-14B-Chat 0.725 0.716
Qwen-14B-Chat 0.717 ↓ -
GPT-3.5 0.851 0.777
GPT-3.5 0.809 ↓ -
Yi-34B-Chat - 0.715
Yi-34B-Chat 0.699 ↓ 0.844 ↓
ChatGLM3-6B - 0.764
ChatGLM3-6B 0.822 ↓ -
Table 3: AUC of each category of line length. "-" indi-
Table 2: AUC after removing poems violating each cates sample size less than 10.
criterion. "↑" indicates higher AUC than original and
"↓" lower. "-" indicates sample size less than 10.

It is also worth noting that, although our results


the convention, it might play an important role in show most of LLMs cannot write classical Chinese
the evaluation of AI poetry. This is the exact same poems like human using the current prompt, it is
case for character repetition. We calculate the AUC possible that these LLMs can do better if more
of each model with poems that violate either crite- advanced prompting techniques are used. We leave
rion removed (see Table 2). Clearly, removing AI them for future work.
poems with atypical line lengths generally makes
them more similar to human poems, but character
repetition does not matter as much. 6 Conclusion
Since most classical Chinese poems have line
length of 5 or 7 (i.e. 5-yan / 7-yan), we separate
the poems into three categories: 5-yan, 7-yan and We propose ProFTAP, a new framework for evalu-
others. We calculate the ratio of generated poems ating AI-generated poetry, which is more objective
6
in each category (Figure 1) and the AUC of 5- and and rigorous than previous methods. We apply
7-yan poems with respect to human poems with the ProFTAP to popular LLMs and reveal that current
same line length (Table 3). We find that there is AI models still have room for improvement in writ-
not much difference between the AUC of 5- and ing classical Chinese poems. However, we do find
7-yan poems, but some models do have a tendency that finetuned open-source LLM can generate clas-
to generate more 7-yan over 5-yan. sical Chinese poems nearly indistinguishable from
those of ancient Chinese poets. We expect our find-
6
Jiuge requires yan to be designated before generation, so ings to be beneficial for future researches on AI
we randomly choose between 5 and 7 with same probability
each time. However, the system occasionally returns poems poetry generation and evaluation.
neither 5-yan nor 7-yan even when it is designated to do so.
References pages 110–119, San Diego, California. Association
for Computational Linguistics.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang,
Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Zhiqiang Liu, Zuohui Fu, Jie Cao, Gerard de Melo,
Huang, et al. 2023. Qwen technical report. arXiv Yik-Cheung Tam, Cheng Niu, and Jie Zhou. 2019.
preprint arXiv:2309.16609. Rhetorically controlled encoder-decoder for Modern
Chinese poetry generation. In Proceedings of the
Brendan Bena and Jugal Kalita. 2019. Introducing as- 57th Annual Meeting of the Association for Computa-
pects of creativity in automatic poetry generation. In tional Linguistics, pages 1992–2001, Florence, Italy.
Proceedings of the 16th International Conference on Association for Computational Linguistics.
Natural Language Processing, pages 26–35, Inter-
national Institute of Information Technology, Hyder- Kobe Millet, Florian Buehler, Guanzhong Du, and
abad, India. NLP Association of India. Michail D. Kokkoris. 2023. Defending humankind:
Anthropocentric bias in the appreciation of ai art.
Huimin Chen, Xiaoyuan Yi, Maosong Sun, Wenhao Li, Computers in Human Behavior, 143:107707.
Cheng Yang, and Zhipeng Guo. 2019. Sentiment-
controllable chinese poetry generation. In Proceed- OpenAI. 2023. Gpt-4 technical report.
ings of the Twenty-Eighth International Joint Con-
ference on Artificial Intelligence, IJCAI-19, pages Aitor Ormazabal, Mikel Artetxe, Manex Agirrezabal,
4925–4931. International Joint Conferences on Arti- Aitor Soroa, and Eneko Agirre. 2022. PoeLM: A
ficial Intelligence Organization. meter- and rhyme-controllable language model for
unsupervised poetry generation. In Findings of the
Michael Chui, James Manyika, and Mehdi Miremadi. Association for Computational Linguistics: EMNLP
2016. Where machines could replace humans-and 2022, pages 3655–3670, Abu Dhabi, United Arab
where they can’t (yet). The McKinsey Quarterly, Emirates. Association for Computational Linguistics.
pages 1–12.
David Silver, Julian Schrittwieser, Karen Simonyan,
David De Cremer and Garry Kasparov. 2021. Ai should Ioannis Antonoglou, Aja Huang, Arthur Guez,
augment human intelligence, not replace it. Harvard Thomas Hubert, Lucas Baker, Matthew Lai, Adrian
Business Review, 18:1. Bolton, et al. 2017. Mastering the game of go without
human knowledge. Nature, 550(7676):354–359.
Liming Deng, Jie Wang, Hangming Liang, Hui Chen,
Zhiqiang Xie, Bojin Zhuang, Shaojun Wang, and Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
Jing Xiao. 2020. An iterative polishing framework bert, Amjad Almahairi, Yasmine Babaei, Nikolay
based on quality aware masked language model for Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
chinese poetry generation. Proceedings of the AAAI Bhosale, et al. 2023. Llama 2: Open founda-
Conference on Artificial Intelligence, 34(05):7643– tion and fine-tuned chat models. arXiv preprint
7650. arXiv:2307.09288.
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Tim Van de Cruys. 2020. Automatic poetry generation
Luke Zettlemoyer. 2023. Qlora: Efficient finetuning from prosaic text. In Proceedings of the 58th Annual
of quantized llms. arXiv preprint arXiv:2305.14314. Meeting of the Association for Computational Lin-
guistics, pages 2471–2480, Online. Association for
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Computational Linguistics.
Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022. GLM:
General language model pretraining with autoregres- Qixin Wang, Tianyi Luo, and Dong Wang. 2016. Can
sive blank infilling. In Proceedings of the 60th An- machine generate traditional chinese poetry? a
nual Meeting of the Association for Computational feigenbaum test. In Advances in Brain Inspired Cog-
Linguistics (Volume 1: Long Papers), pages 320–335, nitive Systems, pages 34–46, Cham. Springer Interna-
Dublin, Ireland. Association for Computational Lin- tional Publishing.
guistics.
Frank Wilcoxon. 1992. Individual comparisons by rank-
Edward A Feigenbaum. 2003. Some challenges and ing methods. In Breakthroughs in Statistics: Method-
grand challenges for computational intelligence. ology and Distribution, pages 196–202. Springer.
Journal of the ACM (JACM), 50(1):32–40.
Xiaoyuan Yi, Maosong Sun, Ruoyu Li, and Zonghan
Gemini Team Google. 2023. Gemini: A family of Yang. 2018. Chinese poetry generation with a work-
highly capable multimodal models. arXiv preprint ing memory model. In Proceedings of the Twenty-
arXiv:2312.11805, arXiv:2312.11805. Seventh International Joint Conference on Artificial
Intelligence, IJCAI-18, pages 4553–4559. Interna-
Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, tional Joint Conferences on Artificial Intelligence
and Bill Dolan. 2016. A diversity-promoting ob- Organization.
jective function for neural conversation models. In
Proceedings of the 2016 Conference of the North Xingxing Zhang and Mirella Lapata. 2014. Chinese
American Chapter of the Association for Computa- poetry generation with recurrent neural networks. In
tional Linguistics: Human Language Technologies, Proceedings of the 2014 Conference on Empirical
Methods in Natural Language Processing (EMNLP), B Training of Qwen-72B-Poet
pages 670–680, Doha, Qatar. Association for Com-
putational Linguistics. We filtered poems from Souyun according to 2
criteria:
Guo Zhipeng, Xiaoyuan Yi, Maosong Sun, Wenhao Li,
Cheng Yang, Jiannan Liang, Huimin Chen, Yuhui
Zhang, and Ruoyu Li. 2019. Jiuge: A human-
1. The poem is written in Tang (唐) dynasty.
machine collaborative Chinese classical poetry gen-
eration system. In Proceedings of the 57th Annual 2. The form of the poem is either gufeng(古
Meeting of the Association for Computational Lin- 风), lushi(律诗), jueju(绝句), siyan(四言), li-
guistics: System Demonstrations, pages 25–30, Flo- uyan(六言) or pailu(排律).
rence, Italy. Association for Computational Linguis-
tics. We sample 96% of them as training data, yielding
40,026 training samples.
A Experimental Details
We convert poems to ChatML format as follows.
A.1 Prompts and Hyperparameters
<|im_start|>system\n
For all LLMs we assessed (except Qwen-72B-
You are a helpful assistant.<|im_end|>\n
Poet), we use the same prompt as below:
<|im_start|>user\n
写 一 首 题 为 《{{title}}》 的{{form}}。
想象你是一位著名诗人,请你写一首题为
<|im_end|>\n
《{{title}}》的古诗。要让别人以为你的
<|im_start|>assistant\n
诗是真人所作,不要让人看出是机器生成
{{content}}<|im_end|>\n
的。
(English translation: Imagine you are a fa- Here, {{form}} is the form of poem, which can
mous poet. Please write a classical poem titled be gufeng(古风), lushi(律诗) or jueju(绝句). We
{{title}}. Lead people to believe that your finetune from Qwen-72B-Chat-Int4 model. We
poem is written by a real human and do not let adopt 4-bit QLoRA (Dettmers et al., 2023) to im-
them realize it is generated by a machine.) prove efficiency. The QLoRA rank is 64 and alpha
is 16. We set batch size to 1 and gradient accumula-
We attempted to use this prompt with LLaMA 2
tion steps to 8. We set max sequence length to 512.
(Touvron et al., 2023) chat models of all different
The learning rate is 1e-4 with 256 warm-up steps.
sizes, but they refuse to respond due to their safety
We train the model for 1.5 epochs. The training
restrictions, so we are not able to test them.
took 70 hours on one Nvidia A100 GPU.
We use a temperature of 0.9 during generation.
The prompt during evaluation is identical to the
We keep other hyperparameters like top-p to the
one used in training. During evaluation, we ran-
default of each model. We do not apply any other
domly choose one form out of the three with equal
modifications.
probability for each prompt.
A.2 Usage of Commercial APIs Notably, we find that when the model is in-
structed to write a poem on a title and form that are
We use API to access proprietary LLMs.
part of the training data, it tends to recite the origi-
The exact model version of GPT-4 we used
nal poem in the training set. We intend to address
is gpt-4-1106-preview (also known as
this issue in the future.
GPT-4 Turbo). The version of GPT-3.5 is
gpt-3.5-turbo-1106. We accessed Ernie-Bot- C Generation Examples
4.0 API in November 2023.
Below are 2 poems generated by Qwen-72B-Poet.
A.3 Experimental Process (1) 夜雪 (Night Snow)
We use Souyun database for anti-plagiarism. We 已觉衾枕冷,转见窗户明。积雪阴云尽,寒飙
consider a poem duplicative if there are no less 曙色清。林疏犹现月,岩深可听声。唯应御楼
than two consecutive lines that match exactly with 望,遥辽近禁城。
line from the database. We convert all text into Already feeling the bed and pillow cold, then seeing
Simplified Chinese. The minimum number of hu- the bright light outside the window. The accumu-
man judges by whom each poem is assessed is 2 lated snow disperses the dark clouds, and the cold
(K = 2). wind gradually makes the dawn clear. The sparse
trees still reveal the moonlight, and faint sounds still and dynamic scenes, brightness and darkness,
come from deep within the mountain rocks. Only by snow and moon scenery, night and dawn scenes. It
gazing from the imperial tower can one approach portrays the night snow not directly but indirectly,
the distant Forbidden City. focusing on human sensations with "bed and pillow
Analysis: The poem adopts a white sketch tech- cold" and "window bright," providing a vivid im-
nique to depict the snowy scenery, incorporating pression. In terms of structure, the poet skillfully
the "cold," "brightness," and "stillness" of the night uses the technique of side description, with "The
snow, creating a distinctive style. Although the accumulated snow disperses the dark clouds, and
poem describes the scenery, it focuses on the word the cold wind gradually makes the dawn clear" cor-
"night," portraying the sights and feelings of a responding to "Already feeling the bed and pillow
snowy night, blending emotion into the scenery. cold, seeing the window bright," creating a cohe-
The opening two lines borrow from Bai Juyi(白 sive and well-structured poem. Additionally, the
居易)’s phrases "已讶衾枕冷,复见窗户明", yet poem employs metaphor and contrast to make the
refine the wording with innovative expression. night snow more tangible, contrasting "The cold
In the opening line, "Already feeling the bed and wind gradually makes the dawn clear" with "The
pillow cold," the poet describes the initial fall of accumulated snow disperses the dark clouds."
night snow, where the poet has already sensed the Therefore, this poem demonstrates a skillful
chill in the bed and pillow, indicating that the snow technique, a well-organized logic, and a clear and
has been falling for quite some time. The poem understandable content. It reads as if it were crafted
uses the perspective of human sensations, making by a real person, making it difficult to discern that
the "night snow" tangible. it was written by a machine solely based on its
Continuing with "seeing the bright light," the content.
poet looks out from the window, noticing that (2) 游三清山 (Roaming the Three Pure Peaks)
the snowfall shows no signs of diminishing, caus- 昔闻三清山,今始到其境。石壁屹当路,万古
ing the poet to feel that the window has become 雪霜冷。山灵有神功,斧凿不可整。巨灵擘两
brighter, while moonlight and lamplight cannot 掌,裂破混沌景。我欲呼神工,移置华山顶。
achieve the same effect. The use of the word "转" Long had I heard of the Three Pure Peaks, today I
(turn) is clever, revealing the entire process of the stand in their midst. Cliffs rise imposingly along
poet’s perception of the snowy scenery from deep the path, bearing witness to millennia of frost and
night to daybreak, emphasizing the poet’s sensitiv- cold. The spirits of this mountain wield mystical
ity to the snow. forces, impervious to the chiseling of axes. The
Subsequently, the poet sees through the win- giant spirit stretches forth its palms, tearing apart
dow, "The accumulated snow disperses the dark the chaotic scenes. I wish to summon that mirac-
clouds, and the cold wind gradually makes the ulous artisan, to relocate the Three Pure Peaks to
dawn clear." The snowfall is significant, lasting the summit of Mount Hua.
throughout the night. By daybreak, the snow stops, Analysis: The poem depicts the magnificent
the sun emerges, the accumulated snow and dark scenery of the Three Pure Peaks, expressing the
clouds disappear, and the cold northern wind sub- author’s sincere admiration for nature, in harmony
sides, leaving a clear dawn illuminating the earth. with the Taoist reverence for the natural order sym-
"Trees sparse, still showing the moon; rocks bolized by the Three Pure Peaks. The language
deep, able to hear sounds." Stepping outside, the of the entire poem is simple, the imagery is grand,
poet observes sparse trees with the moon still shin- and the style is elevated.
ing, and deep rocks where one can hear the sound "Long had I heard of the Three Pure Peaks, to-
of the wind. The poet not only portrays the night day I stand in their midst." The poem begins with
scene with clear moonlight but also conveys the a discourse, and the phrase "Long had I heard" sets
serene night with audible stillness in the deep rocks. a lofty and powerful tone for the entire poem. The
In the final line, "Only from the imperial tower three words "Three Pure Peaks" not only clarify
can one see, distant and close to the Forbidden the title but also lead into the subsequent verses.
City," it indicates that one can only appreciate the The phrase "today I stand in their midst" informs
distant snowy scenery near the Forbidden City by us of the time the poet visited the Three Pure Peaks,
ascending the imperial tower. indicating a transition from hearing about them to
The poem depicts both distant and near views, actually experiencing them. The word "today" sug-
gests the poet’s journey was not an easy one, laying scenery of the Three Pure Peaks is naturally formed,
the groundwork for the subsequent admiration of not the result of human intervention. Although it
the spectacular scenery. appears to be a travel poem, it completely avoids
The poet then proceeds to describe the scenery cliches, giving it an authentic quality that is diffi-
of the Three Pure Peaks in two poetic lines. "Cliffs cult to distinguish in terms of technique-whether it
rise imposingly along the path, bearing witness to was written by a machine or a human.
millennia of frost and cold." The most striking as-
pect of the Three Clear Peaks is their peculiar peaks
and rocks, and the exaggerated phrase "millennia
of frost and cold" vividly conveys the perennial
snow and the naturally severe environment. These
two lines succinctly and vividly depict the scenery,
striking a balance between conciseness and vivid-
ness.
Using an exaggerated style, the author then por-
trays the extraordinary features of the Three Pure
Peaks. "The mountain spirits possess mystical
powers, impervious to the chiseling of axes." This
vividly conveys that the Three Clear Peaks are
the result of supernatural craftsmanship, steep and
rugged without any human intervention, naturally
formed. "The giant spirit stretches forth its palms,
tearing apart the chaotic scenes," refers to the myth-
ical giant spirit who split open the scenes of chaos
on Mount Hua. The poem, by drawing on mythol-
ogy, further emphasizes the ruggedness and tower-
ing nature of the Three Pure Peaks.
The final two lines, "I wish to summon that
miraculous artisan, to relocate the Three Pure Peaks
to the summit of Mount Hua," echo the opening
lines. The poet, witnessing the divine craftsman-
ship of the Three Clear Peaks, desires to call upon
the artisan to move them to the summit of Mount
Hua, enhancing the beauty of Mount Hua and ex-
pressing the author’s deep affection for the Three
Pure Peaks.
The poem combines reality and imagination,
with the first half describing the scenery of the
Three Pure Peaks and the second half portraying
the poet’s imaginative scene. The poem seamlessly
transitions between the two, creating a cohesive
whole. Beginning with a discourse, then depicting
the peculiar peaks and rocks, and finally weaving
in mythical legends, the poem paints a picture of
the Three Pure Peaks as lofty, perilous, with pecu-
liar peaks and rocks, evoking a sense of awe and
fascination.
This poem bears the title of "Three Pure Peaks,"
yet it does not focus on describing the scenery of
the Three Pure Peaks. Instead, it approaches the
subject from the perspective of the immortals sym-
bolized by the Three Pure Ones, explaining that the

You might also like