0% found this document useful (0 votes)
25 views

Chatgpt Dep Parser

1. The document explores whether large language models like ChatGPT have the ability to perform zero-shot dependency parsing without explicit training for the task. 2. The experiments show that ChatGPT is able to produce dependency parses in the CoNLL format for sentences, demonstrating it has some ability for zero-shot dependency parsing. 3. Linguistic analysis of the ChatGPT outputs found they maintain similar structure across languages and show some parsing preferences, though the outputs sometimes violate linguistic rules.

Uploaded by

williammpearson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Chatgpt Dep Parser

1. The document explores whether large language models like ChatGPT have the ability to perform zero-shot dependency parsing without explicit training for the task. 2. The experiments show that ChatGPT is able to produce dependency parses in the CoNLL format for sentences, demonstrating it has some ability for zero-shot dependency parsing. 3. Linguistic analysis of the ChatGPT outputs found they maintain similar structure across languages and show some parsing preferences, though the outputs sometimes violate linguistic rules.

Uploaded by

williammpearson
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

ChatGPT is a Potential Zero-Shot Dependency Parser

Boda Lin1‡ , Xinyi Zhou2‡ , Binghao Tang1 , Xiaocheng Gong1 , Si Li1∗


1
School of Artificial Intelligence, Beijing University of Posts and Telecommunications
2
Department of Chinese Language and Literature, East China Normal University
{linboda, lisi}@bupt.edu.cn

Abstract the specific functions learned by each layer of


BERT, which also touches upon the research of
Pre-trained language models have been widely
self-acquisition of syntax for PLMs (Rogers et al.,
used in dependency parsing task and have
achieved significant improvements in parser
2020). However, the BERT model presents the
arXiv:2310.16654v1 [cs.CL] 25 Oct 2023

performance. However, it remains an under- following limitations: 1) These studies often in-
studied question whether pre-trained language duce dependency parsing results from the attention
models can spontaneously exhibit the ability of mechanism of BERT. These parsing results do not
dependency parsing without introducing addi- stem from natural generative steps. In particular,
tional parser structure in the zero-shot scenario. parameterized probing methods could introduce ex-
In this paper, we propose to explore the depen- ternal information interference (Wu et al., 2020); 2)
dency parsing ability of large language models
Due to model limitations, these works are often con-
such as ChatGPT and conduct linguistic analy-
sis. The experimental results demonstrate that fined to relatively simple datasets and struggles to
ChatGPT is a potential zero-shot dependency directly yield syntactic results with dependency re-
parser, and the linguistic analysis also shows lation labels. Despite the subsequent BART (Lewis
some unique preferences in parsing outputs. et al., 2020), T5 (Raffel et al., 2020), and other
encoder-decoder structured PLMs showing good
1 Introduction performance in language generation, the complex-
Dependency parsing is a fundamental task in Nat- ity of the expression form inherent in the task of
ural Language Processing and have many applica- dependency parsing still makes it challenging to
tions in downstream tasks, such as machine trans- induce syntactic results from such PLMs in a more
lation (Bugliarello and Okazaki, 2020), question straightforward manner.
answering (Teney et al., 2017), and information Recently, Large Language Models (LLMs) such
retrieval (Chandurkar and Bansal, 2017). Previous as InstructGPT (Ouyang et al., 2022) and Chat-
research mainly focus on how to design the parser GPT 1 , which possess superior generative capabili-
structure and parsing algorithms to achieve better ties, have been introduced in NLP. These models
performance in different scenarios (Dozat and Man- have achieved impressive performance on various
ning, 2017; Ma et al., 2018; Li et al., 2019). NLP tasks, including question answering, reading
The linguistic base of dependency parsing is the comprehension, and summarization (Ouyang et al.,
dependency grammar (Jarvinen and Tapanainen, 2022), even in a zero-shot fashion, providing a cru-
1998), which come from the linguists’ research cial key to investigating the innate syntactic abili-
about linguistic rules and phenomena. Notably, ties of language models. We are interested in the
Pre-trained Language Models (PLMs) can also following questions: 1) Do LLMs like ChatGPT
be viewed as "linguists" that automatically learn possess zero-shot dependency parsing capabilities?
rules from a vast amount of natural language texts. 2) If so, do the outputs of ChatGPT still maintain
Therefore, investigating whether these PLMs spon- a similar structure for similar sentences even be-
taneously learn certain syntactic rules during the tween different languages? 3) Further more, do
pre-training stage is a valuable research topic. these parsing results contain some preferences that
After the proposal of the BERT model (Devlin can be summarized?
et al., 2019), numerous probing studies explore In this paper, we explore using ChatGPT and
other LLMs to achieve zero-shot dependency pars-

Boda Lin and Xinyi Zhou make equal contribution
∗ 1
Corresponding author https://openai.com/blog/chatgpt
please using standard CoNLL formatio 1 He … 2 nsubj
2 has … 0 root
n (The outputs are 10 column results, t 3 denied … 2 xcomp
he 1th column is the word ID, the 2th c 4 this ... 3 obj
olum is word, the 7th column is the he 5 . … 2 punct
ad, and the 8th column is the relation, post- analysis results
th 9th is the ‘_’ and the 10th is the ‘_’,
all columns are separated by standard processing
tabs) with Penn Treebank relations do 1 He … 3 nsubj
dependency parsing for this sentence 2 has … 3 aux
“He has denied this .” 3 denied … 0 root
4 this ... 3 obj
5 . … 3 punct

Figure 1: The total framework of our work.

ing, and conduct linguistic analysis on these pars- segmentation of the parsing corpus, while the out-
ing results to answer these questions. put is generated in CoNLL format.
The results demonstrate the ChatGPT is a poten- The parsing results from ChatGPT may exhibit
tial zero-shot dependency parser and the outputs various format-related issues, which include word
of ChatGPT maintain similar structure in differ- missing, format disruption, word segmentation,
ent languages. And we summarize some parsing word scrambling, and multiple outputs. We use
preferences of ChatGPT through linguistic analysis. post-processing to filter the illegal outputs, the spe-
The most surprising finding is that in some cases, cific statistical details can refer to the Appendix A.
ChatGPT outputs are more in line with linguistic
rules than gold annotations. 4 The Consistency

2 Related Work In order to explore whether the dependency pars-


ing outputs of ChatGPT maintain the consistency
2.1 Dependency Parsing in different languages, we conduct corresponding
PLMs are widely used in previous dependency pars- experiments and use Dependency Tree Edit Dis-
ing research. Biaffine (Dozat and Manning, 2017) tance (DTED) (McCaffery and Nederhof, 2016) to
with BERT (Devlin et al., 2019) is a very simple measure the similarity between dependency syntax
and effective parser. Yang and Tu (2022) proposes trees in different languages.
a graph-based parser based on headed spans. Lin
et al. (2022) design the serialization of parsing trees
EditDist(Ta , Tb )
and enabling the T5 model directly generate pars- DT ED(Ta , Tb ) = 1 − (1)
max(|Ta |, |Tb |)
ing sequence. Besides, the newest state-of-the-art
model Hexatagging (Amini et al., 2023) cast the Where Ta and Tb means two parsing trees from
parsing tree into the hexatag sequence and also only different language, and the value of DTED score is
use BERT to achieve parsing. But these methods from [0, 1]. The EditDist is calculate based on the
still rely on the supervised learning paradigm. Tree Edit Distance algorithm (Zhang and Shasha,
1989).
2.2 Probing
The probe research of the BERT model can be 5 Parsing Ability Experiment
roughly divided into two categories: parametric
methods and non-parametric methods. Jawahar 5.1 Setting
et al. (2019) use a series of probing tasks to indi- In this paper, we investigate the parsing ability
cate that while BERT does capture some of this of ChatGPT from two different settings: 1) The
information, it is not always explicitly encoded normal zero-shot parsing setting. In this setting,
within BERT’s representations. Wu et al. (2020) we use gpt-3.5-turbo, HuggingChat2 , Vicuna-13B3
use non-parametric method to achieve probing task and ChatGLM-6B (Zeng et al., 2022) to directly
for BERT. achieve dependency parsing on English and Chi-
nese. 2) The cross-lingual parsing setting. In this
3 ChatGPT Parsing setting, we collect the sentence pairs from English
As shown in Figure 1, our approach leverages the and Chinese which have the similar parsing struc-
prompt to enable ChatGPT for zero-shot depen- ture. Then we conduct dependency parsing on these
dency parsing. Specifically, we define the input 2
https://huggingface.co/chat
3
format as a sentence adhering to the original word https://lmsys.org/blog/2023-03-30-vicuna
Dataset Method UAS LAS There is no doubt that ChatGPT has the ablility
of zero-shot dependency parsing. In fact, this ca-
Biaffine 95.74 94.08 pability is already quite rare in other LLMs. We
StackPTR 95.87 94.19 also conducted zero-shot experiments on some
PTB DPSG 96.64 95.82 other popular LLMs (HuggingChat, Vicuna-13B,
Hexatagging 97.40 96.40 ChatGLM-6B)5 and found that maintaining a gen-
ChatGPT 40.22 28.61 erally correct CoNLL format output is a very chal-
Biaffine 89.30 88.23 lenging task for these LLMs. Furthermore, we
StackPTR 90.59 89.29 conducted one-shot demonstration learning exper-
CTB5
Hexatagging 93.20 91.90 iments for these LLMs, with only Vicuna able to
ChatGPT 28.08 11.93 learn a fairly close format through example. To
rule out the complexity of the CoNLL format itself,
en-top ChatGPT 32.88 26.68
we converted the CoNLL format parsing tree into
zh-top ChatGPT 53.19 39.36
a sequence format following (Lin et al., 2022), but
Table 1: The results of ChatGPT and other traditional other LLMs were still unable to produce satisfac-
supervised parsing methods on PTB and CTB5. The tory results.
en-top and zh-top mean the top 100 sentences extract The results on Table 2 answer the second ques-
from the en-ewt-test and zh-cfl-test. tion in the introduction, the output similarity of
ChatGPT between English and Chinese reaches
Gold ChatGPT 0.46, which is 72% of the gold similarity, which
shows that for sentences with similar structures
Avg_DTED 0.64 0.45
in different languages, the output results given by
Table 2: The average DTED score between top 50 simi-
ChatGPT still have a high structural similarity. We
larity sentences of English and Chinese. show more details in Appendix B.

6 Linguistic Analysis
sentence pairs and analyze the same or difference
between English and Chinese. In order to avoid In this section, we answer the last question in the
the influence of generating randomness, we set the introduction. We analyze the 50 sentences in the
temperature of ChatGPT to 0 in all experiments. test set of en-ewt-test in UD2.2 for English and the
total test set of CTB5 for Chinese. More details
5.2 Dataset and examples are listed in the Appendix C and
For English parsing, we choose the proverbial Appendix E.
benchmark Penn Treebank (PTB) (Marcus et al.,
1993), Chinese Treebank 5 (CTB5) (Xue et al., 6.1 English Analysis
2005) and 12 languages from Universal Depen- More precisely, the analysis conducted on English
dencies Version 2.2 (UD2.2) (Nivre et al., 2016) outputs from ChatGPT reveals distinct error pat-
following the previous work (Ma et al., 2018). For terns that can be classified into the following:
PTB and CTB5, we follow Ma et al. (2018) to Predicative Verb-Centrism: This phenomenon
use the Stanford basic Dependencies representa- pertains to the inclination of perceiving the verb in
tion (de Marneffe et al., 2006) of PTB and CTB the predicate as the central element. This tendency
converted by Stanford parser4 . is commonly observed during the processing of
For the cross-lingual parsing setting, we use the clauses, particularly when the subject is omitted.
DTED score to choose the most similarity top-50 Noun Subject Preference: This tendency re-
sentences from en-ewt-test and zh-cfl-test of UD flects a predisposition to designate a noun as the
2.2. subject within a phrase.
Preposition-Case Ambiguity: This phe-
5.3 Parsing Ablility
nomenon pertains to the inconsistent categorization
According the results shown in Table 1 and Table 3, of simple prepositions such as "to", "in", "case",
we can answer the first question in the introduction.
5
Since other LLMs lacking the CoNLL or sequence pars-
4
http://nlp.stanford.edu/software/lex-parser.html ing output capability, we cannot calculate the performance
Model bg ca cs de en es fr it nl no ro ru AVG
Biaffine 90.30 94.49 92.65 85.98 91.13 93.78 91.77 94.72 91.04 94.21 87.24 94.53 91.82
StackPTR 89.96 92.39 90.94 86.16 89.83 91.52 89.88 92.55 91.73 93.62 85.34 93.07 90.58
DPSG 93.92 93.75 92.97 84.84 91.49 92.37 90.73 94.59 92.03 95.30 88.76 95.25 92.17
Hexatagger 92.87 93.79 92.82 85.18 90.85 93.17 91.50 94.72 91.89 93.95 87.54 94.03 91.86
ChatGPT 35.87 34.04 33.37 26.88 32.63 35.01 29.56 24.60 29.79 30.86 33.78 34.97 31.78

Table 3: Results on 12 languages of UD2.2 in terms of LAS.

and "for", where they are labeled inconsistently as the appropriate word to which the dependent should
either prepositions or case. be linked.
Auxiliary Verb Mislabeling: This phenomenon Nummod-Dep: The utilization of numerals
arises when a sentence comprises both auxiliary within ChatGPT outputs is typically labeled as
verb and non-finite verb. In such cases, ChatGPT "nummod", whereas the gold standard annotations
tends to designate the auxiliary verb as the root due often designate it as "dep". It is worth noting that
to the influence of positional placement. in certain cases, the gold annotation may be in-
Adjective Modifier Ambiguity: This phe- accurate, incorrectly assuming that the numeral
nomenon emerges when a noun is preceded by modifies a noun that it logically does not modify.
multiple modifiers, creating ambiguity in discern-
ing the appropriate modifier. 6.3 Summarization
In general, the inconsistencies between ChatGPT
6.2 Chinese Analysis and golds can be categorized into three groups: 1)
We also conduct linguistic analysis for CTB5, and Instances where ChatGPT outputs are incorrect;
the observations are summarized into the following, 2) Cases where ChatGPT and golds can both be
more statistics details are shown in Appendix D. considered correct, but differ due to distinct per-
Obj-Dobj: The most prevalent disparity ob- spectives considered in the annotations; 3) Situ-
served between the ChatGPT and the golds is the ations where ChatGPT exhibits greater linguistic
"obj-dobj". Although both "obj" and "dobj" signify normativity compared to the golds.
a direct object relationship, their usage is inconsis- The third category is intriguing as it highlights
tent across the two tagging sets, indicating potential the potential of ChatGPT. While traditional pars-
variations in labeling conventions. ing methods based on supervised learning achieve
Compound-Nmod: The differentiation between impressive performance, the parsing capabilities of
the "compound" (indicating a compound noun, ad- these models are constrained by the labeled data.
jective, or adverb) and "nmod" (representing a nom- However, the parsing abilities of ChatGPT acquired
inal modifier of a noun or pronoun) frequently leads through pre-training may transcend this limitation
to discrepancies between ChatGPT and the golds. and offer researchers a novel perspective. This
breakthrough could potentially provide researchers
Quote Root: When a sentence portrays a state-
with valuable insights or alternative viewpoints.
ment attributed to an individual, with the speaker
positioned at the end of the sentence, ChatGPT
7 Conclusion
demonstrates a tendency to designate the verb
within the quotation as the root, whereas the golds We employ the prompt to investigate the zero-
mark the verb "say" as the root. shot dependency parsing capability of ChatGPT
Error Root: ChatGPT occasionally misclassi- and other Language Models (LLMs) on prover-
fies punctuation marks as the root, thereby assign- bial benchmarks. The experimental results substan-
ing them undue prominence. Additionally, it dis- tiate that ChatGPT exhibits promising potential
plays inconsistencies in labeling the main verb as as a zero-shot dependency parser. Furthermore,
the root, often assigning this role to nouns instead. cross-lingual experiments demonstrate the ability
In certain cases, ChatGPT fails to identify the root. of ChatGPT to maintain similarity in parsing out-
Dobj-Nmod: In certain instances, ChatGPT er- puts across different languages. Additionally, lin-
roneously labels "dobj" as "nmod". This primarily guistic analysis is performed to discern the pars-
transpires when there are disagreements regarding ing output preferences of ChatGPT. The analysis
reveals that ChatGPT has the ability to surpass lim- deep bidirectional transformers for language under-
itations stemming from errors in labeled data. standing. In Proceedings of the 2019 Conference of
the North American Chapter of the Association for
Computational Linguistics: Human Language Tech-
Limitations nologies, Volume 1 (Long and Short Papers), pages
4171–4186, Minneapolis, Minnesota. Association for
Considering the powerful learning ability of Large Computational Linguistics.
Language Models (LLMs), we use prompt-based
method to analyze the zero-shot ability of Depen- Timothy Dozat and Christopher D. Manning. 2017.
dency Parsing of LLMs. The different formats of Deep biaffine attention for neural dependency pars-
prompts might significantly affect the final outputs ing. In 5th International Conference on Learning
Representations, ICLR 2017, Toulon, France, April
and could be a disturbance for our experiments. 24-26, 2017, Conference Track Proceedings. Open-
Moreover, the datasets and corpora usage of LLMs Review.net.
is unclear and that might influence our linguistic
analyses. In addition, linguistic analysis may be Timo Jarvinen and Pasi Tapanainen. 1998. Towards an
mixed with some subjective judgments, and due to implementable dependency grammar. arXiv preprint
cmp-lg/9809001.
the complex format of the parsing data, many lin-
guistic analysis phenomena are difficult to directly Ganesh Jawahar, Benoît Sagot, and Djamé Seddah.
perform data statistics. 2019. What does BERT learn about the structure of
language? In Proceedings of the 57th Annual Meet-
Ethics Statement ing of the Association for Computational Linguistics,
pages 3651–3657, Florence, Italy. Association for
We affirm that our work here does not exacerbate Computational Linguistics.
the biases already inherent in the large language
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan
models and our linguistic analyses are also only Ghazvininejad, Abdelrahman Mohamed, Omer Levy,
based on those model outputs. As a result, we Veselin Stoyanov, and Luke Zettlemoyer. 2020.
anticipate no ethical concerns associated with this BART: Denoising sequence-to-sequence pre-training
research. for natural language generation, translation, and com-
prehension. In Proceedings of the 58th Annual Meet-
ing of the Association for Computational Linguistics,
pages 7871–7880, Online. Association for Computa-
References tional Linguistics.
Afra Amini, Tianyu Liu, and Ryan Cotterell. 2023. Hex-
Zhenghua Li, Xue Peng, Min Zhang, Rui Wang, and
atagging: Projective dependency parsing as tagging.
Luo Si. 2019. Semi-supervised domain adaptation
Association for Computational Linguistics.
for dependency parsing. In Proceedings of the 57th
Emanuele Bugliarello and Naoaki Okazaki. 2020. En- Annual Meeting of the Association for Computational
hancing machine translation with dependency-aware Linguistics, pages 2386–2395, Florence, Italy. Asso-
self-attention. In Proceedings of the 58th Annual ciation for Computational Linguistics.
Meeting of the Association for Computational Lin-
guistics, pages 1618–1627, Online. Association for Boda Lin, Zijun Yao, Jiaxin Shi, Shulin Cao, Bing-
Computational Linguistics. hao Tang, Si Li, Yong Luo, Juanzi Li, and Lei Hou.
2022. Dependency parsing via sequence generation.
Avani Chandurkar and Ajay Bansal. 2017. Information In Findings of the Association for Computational
retrieval from a structured knowledgebase. In 11th Linguistics: EMNLP 2022, pages 7339–7353, Abu
IEEE International Conference on Semantic Comput- Dhabi, United Arab Emirates. Association for Com-
ing, ICSC 2017, San Diego, CA, USA, January 30 - putational Linguistics.
February 1, 2017, pages 407–412. IEEE Computer
Society. Xuezhe Ma, Zecong Hu, Jingzhou Liu, Nanyun Peng,
Graham Neubig, and Eduard Hovy. 2018. Stack-
Marie-Catherine de Marneffe, Bill MacCartney, and pointer networks for dependency parsing. In Pro-
Christopher D. Manning. 2006. Generating typed ceedings of the 56th Annual Meeting of the Associa-
dependency parses from phrase structure parses. In tion for Computational Linguistics (Volume 1: Long
Proceedings of the Fifth International Conference Papers), pages 1403–1414, Melbourne, Australia. As-
on Language Resources and Evaluation (LREC’06), sociation for Computational Linguistics.
Genoa, Italy. European Language Resources Associ-
ation (ELRA). Mitchell P. Marcus, Beatrice Santorini, and Mary Ann
Marcinkiewicz. 1993. Building a large annotated cor-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and pus of English: The Penn Treebank. Computational
Kristina Toutanova. 2019. BERT: Pre-training of Linguistics, 19(2):313–330.
Martin McCaffery and Mark-Jan Nederhof. 2016. Wendi Zheng, Xiao Xia, et al. 2022. Glm-130b:
DTED: Evaluation of machine translation structure An open bilingual pre-trained model. arXiv preprint
using dependency parsing and tree edit distance. In arXiv:2210.02414.
Proceedings of the First Conference on Machine
Translation: Volume 2, Shared Task Papers, pages Kaizhong Zhang and Dennis Shasha. 1989. Simple
491–498, Berlin, Germany. Association for Compu- fast algorithms for the editing distance between trees
tational Linguistics. and related problems. SIAM journal on computing,
18(6):1245–1262.
Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-
ter, Yoav Goldberg, Jan Hajič, Christopher D. Man- A Post-processing Details
ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,
Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. Since the dependency parsing task is a fine-grained
2016. Universal Dependencies v1: A multilingual task, it has high requirements on vocabulary and
treebank collection. In Proceedings of the Tenth In- output format, and the uncontrollability of LLM it-
ternational Conference on Language Resources and
self, there will be many formal errors in the outputs
Evaluation (LREC’16), pages 1659–1666, Portorož,
Slovenia. European Language Resources Association of ChatGPT, as follows:
(ELRA). Word filtering: Since some of the parsing cor-
pora come from political news, certain vocabulary
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Carroll Wainwright, Pamela Mishkin, Chong Zhang, may trigger the filtering policy of ChatGPT, lead-
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. ing to the omission of sensitive words in the output
2022. Training language models to follow instruc- CoNLL results.
tions with human feedback. Advances in Neural Format disruption: Occasionally, ChatGPT may
Information Processing Systems, 35:27730–27744.
not output in the standard CoNLL format, causing
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine issues such as missing columns, extra columns, or
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, disordered columns.
Wei Li, and Peter J. Liu. 2020. Exploring the limits Word segmentation disruption: This phe-
of transfer learning with a unified text-to-text trans-
former. J. Mach. Learn. Res., 21:140:1–140:67. nomenon is particularly common in languages that
require word segmentation, such as Chinese. Even
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. though we clearly pre-segmented the input with
2020. A primer in BERTology: What we know about spaces, ChatGPT may sometimes employ its own
how BERT works. Transactions of the Association
for Computational Linguistics, 8:842–866. segmentation.
Word omission: In lengthy sentences, there
Damien Teney, Lingqiao Liu, and Anton van den Hen- might be instances where a sequence of words is
gel. 2017. Graph-structured representations for vi-
missing.
sual question answering. In 2017 IEEE Conference
on Computer Vision and Pattern Recognition, CVPR Word scrambling: In extended sentences, the
2017, Honolulu, HI, USA, July 21-26, 2017, pages outputted CoNLL results may contain parts where
3233–3241. IEEE Computer Society. the vocabulary is scrambled.
Zhiyong Wu, Yun Chen, Ben Kao, and Qun Liu. 2020.
Multiple outputs: In some cases, ChatGPT will
Perturbed masking: Parameter-free probing for ana- give duplicate parsing outputs for a sentence.
lyzing and interpreting BERT. In Proceedings of the Since these formal errors will cause predict and
58th Annual Meeting of the Association for Compu- gold to fail to achieve alignment, we use post-
tational Linguistics, pages 4166–4176, Online. Asso-
processing to filter out the output containing these
ciation for Computational Linguistics.
errors. The size of the original test sets and the
Nianwen Xue, Fei Xia, Fu-dong Chiou, and Marta size of the data obtained after post-processing are
Palmer. 2005. The penn chinese treebank: Phrase shown in the Table 4.
structure annotation of a large corpus. Natural Lan-
guage Engineering, 11(2):207–238.
B Examples of Similarity Trees
Songlin Yang and Kewei Tu. 2022. Headed-span-based
In addition to the calculated DTED scores as shown
projective dependency parsing. In Proceedings of the
60th Annual Meeting of the Association for Compu- in Table 2 in Section 5, we can also intuitively
tational Linguistics (Volume 1: Long Papers), pages see from the Table 5 that ChatGPT outputs similar
2188–2200, Dublin, Ireland. Association for Compu- sentences from different languages with similar
tational Linguistics. structures.
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, In the Table 5, the left and right columns are the
Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, corresponding Chinese-English syntactic tree pairs
PTB CTB5 bg ca cs de en es fr it nl no ro ru
Total Number 2, 416 1, 915 1, 116 1, 846 12, 203 977 2, 077 2, 174 416 482 1, 396 3, 450 729 6, 491
Final Number 1, 394 990 518 1, 307 8, 040 505 1, 164 1, 644 283 374 614 2, 492 498 5, 742

Table 4: The number of sentences in the original test set of PTB, CTB5 and 12 languages on UD2.2 and the number
of sentences retained after post-processing.

ID Word Pred-Head Pred-Rel Gold-Head Gold-Rel Word Pred-Head Pred-Rel Gold-Head Gold-Rel
1 He 2 nsubj 3 nsubj 我们 2 nsubj 3 nsubj
2 has 0 root 3 aux 要 0 root 3 aux
3 denied 2 xcomp 0 root 去 2 xcomp 0 root
4 this 3 obj 3 obj 目的地 3 obj 3 obj
5 . 2 punct 3 punct ! 2 punct 3 punct
1 you 2 nsubj 3 nsubj 你 2 nsubj 3 nsubj
2 r 0 root 3 cop 是 0 root 3 cop
3 retarded 2 xcomp 0 root 学生 2 attr 0 root
4 . 2 punct 3 punct ? 2 punct 3 punct
1 Just 2 advmod 3 advmod 有点儿 2 amod 2 advmod
2 our 3 amod 3 nmod:poss 恼火 0 root 0 root
3 standard 0 root 0 root 了 2 mark 2 discourse:sp
4 . 3 punct 3 punct 。 2 punct 2 punct

Table 5: Three similarity parsing tree pair examples.

with similar structure, among which, the DTED person singular form. Consequently, the analysis
score of the first two pairs is 1, and the DTED of the sentence exhibits a flaw in terms of marking
score of the third pair is 0.8. Obviously, although the subject accurately.
the sentences in the syntactic tree pairs are from Preposition-Case Ambiguity: In the sentence
different languages, the output of ChatGPT still has "Compare the flags to the Fallujah one." the stan-
a high degree of similarity even though the wrong dard analysis designates "to" as a case-grammatical
parsing outputsare given. marker, which is dependent on "one." The word
"one" serves as the object being compared, and
C Examples of English Linguistic the presence of "to" indicates that "one" is the di-
Analysis rect object. However, ChatGPT only identifies "to"
Predicative Verb-Centrism: In the sentence as a preposition, overlooking its role as a case-
"What if Google Morphed Into GoogleOS?", the grammatical marker. Additionally, there is ambi-
correct root token should be "what". However, guity in the phrase "the Fallujah one" that follows.
the ChatGPT result indicates that the subordinate As ChatGPT only labels "to" as a preposition, it
clause is not recognized, and only the predicate may interpret it as "to the Fallujah," suggesting
verb is identified. Conversely, in cases where a that "the" is the determiner for the compound noun
clause contains a main clause with a predicate verb, "Fallujah one." However, "Fallujah" does not natu-
such as "One of the pictures shows a flag that was rally form a compound noun with "one," and it is
found in Fallujah". ChatGPT does not incorrectly not possible to establish a dependency relationship
mark the root as unmarked. among "the," "Fallujah," and "one."
Noun Subject Preference: For example, "One of Auxiliary Verb Mislabeling: In the sentence "
the pictures shows a flag that was found in Fallu- I’m staying away from the stock." the correct root
jah. In this sentence, the actual subject within the should be "staying," while "am" functions as the
phrase "one of the pictures" is "one" and "pictures" auxiliary verb assisting in tense formation. How-
functions as the noun being quantitatively modified. ever, there is a mislabeling where "am" is incor-
The words "one", "of", and "the" are dependent on rectly marked as the root. Similarly, in the sentence
"pictures" as well. However, in reality, the true sub- "He has denied this." the root should be "denied,"
ject should be "one", "of", and "the" collectively. but ChatGPT mistakenly identifies "has" as the root.
The reason is that if the subject were solely "pic- Moreover, in the sentence "It does seem that Irani-
tures", the verb "shows" would not be in the third ans frequently make statements and then hide be-
hind the lack of proof." the root should be "seem", The crucial aspect to consider is the presence of
but ChatGPT erroneously identifies "does" as the a conceptual overlap between "compound," which
root. These instances highlight inconsistencies in denotes a compound construction of a noun, ad-
root identification by ChatGPT, where the actual jective, or adverb, and "nmod," which signifies a
root is mislabeled in favor of auxiliary verbs or nominal modifier, i.e., a noun, adjective, or adverb
other words in the sentence. modifying another noun or pronoun. The distinc-
Adjective Modifier Ambiguity: In the sentence tion between these two categories is not always
"The clerics demanded talks with local US com- clearly defined, and there are instances where deter-
manders." ChatGPT tends to analyze the sentence mining whether it should be labeled as "compound"
in a way that suggests a dependency between the or "nmod" can be subjective. Therefore, the label-
first modifier and the second modifier, and another ing norms and conventions for these categories still
dependency between the second modifier and the remain a matter of debate and interpretation.
noun "commanders". Although this dependency Quote Root: When a sentence is spoken by some-
may raise semantic concerns, it is syntactically ac- one and the speaker is positioned at the end of
ceptable. However, according to the gold standard the sentence, resulting in a complete sentence with
annotations, it is "local" that is dependent on "com- a predicate verb, the ChatGPT labeling scheme
manders" and "US" has a separate dependency with assigns the predicate verb as the root. On the
"commanders". In other words, there is no direct other hand, the gold standard annotation assigns the
relationship between "local" and "US" in the gold speaker, represented by the Chinese character "说"
standard annotations. , as the root in such cases. Similarly, in instances
where ChatGPT outputs labels the verb "怀疑" as
D Statistics of Chinese Linguistic root, while gold labels "他" as root. These discrep-
Analysis ancies in root labeling between ChatGPT outputs
We count the number of sentences appearing in and gold exemplify the differing perspectives and
ChatGPT outputs for several types of linguistic criteria utilized in these annotation schemes.
analysis given in Section 6.2, as shown in the Ta- Error Root: The annotation of root in Chinese cor-
ble 7. pora can sometimes exhibit unexpected errors. One
particular error involves the mislabeling of punctu-
E Examples of Chinese Linguistic ation as the root by ChatGPT. Furthermore, Chat-
Analysis GPT does not consistently label predicate verbs as
the root in all cases. These inconsistencies high-
Obj-Dobj: The most frequent label that differs
light the challenges and potential shortcomings in
between ChatGPT results and gold is "obj - dobj"
the annotation process for determining the root in
(labeled obj in the outputs of ChatGPT and dobj in
Chinese sentences within the ChatGPT annotation
gold). In the CoNLL format data description pro-
scheme.
vided, obj is the direct object relationship, while
dobj (direct object) is also the direct object. And Dobj-Nmod: There are instances in which the
both obj and dobj appear in ChatGPT outputs, ChatGPT labels the direct object (dobj) as a nomi-
whereas there is no obj tag in gold. This suggests nal modifier (nmod). One such case is exemplified
that the labels used in the two sets of annotation by sentence "发言人主张该国就入侵邻国正式
results are different, and that the labels in ChatGPT 道歉" , where ChatGPT considers "邻国" to be
outputs are confusing. For example, in the follow- dependent on the noun "入侵" and assigns the
ing sentence, ChatGPT outputs and gold label the relation as nmod. In contrast, the gold standard
dependencies identically, but for the relationship annotation marks the relation as dobj. This dis-
between "欧文" and "看好" , ChatGPT outputs crepancy in labeling suggests a disagreement in
labels it as obj while gold labels it as dobj. assigning the correct dependency relation between
Compound-Nmod: It is easier to judge the rela- the two annotation schemes.
tionship of compound (compound noun) differently Nummod-Dep: When it comes to number words
from gold, which is more often labelled as nmod, as in sentences, ChatGPT more frequently labels them
in the example below, where "联赛" is dependent as "nummod", whereas gold annotations often label
on "俱乐部" and ChatGPT labels the relationship them as "dep." This distinction arises primarily due
as nmod, whereas gold is labelled as compound. to differing judgments regarding dependency rela-
tionships between the two annotation schemes. The ID Word Pred-Head Pred-Rel Gold-Head Gold-Rel
variation in labeling can be attributed to differences Predicative Verb-Centrism
1 What 4 nsubj 0 root
in the interpretation of the role and dependency 2 if 4 mark 4 mark
of number words within the sentence structure. It 3 Google 4 nsubj 4 nsubj
4 Morphed 0 root 1 advcl
highlights the impact of subjective judgment in 5 Into 4 prep 6 case
determining the appropriate dependency label for 6 GoogleOS 5 pobj 4 obl
7 ? 4 punct 4 punct
number words.
Noun Subject Preference
Gold Errors Regarding the gold annotations, there 1 One 2 nummod 5 nsubj
is also a significant issue with errors in their la- 2 of 4 case 4 case
3 the 4 det 4 det
beling. In such cases, it becomes challenging to 4 pictures 5 nsubj 1 nmod
5 shows 0 root 0 root
compare and analyze the preferences between gold 6 a 7 det 7 det
and ChatGPT outputs. For example, in the given 7 flag 5 obj 5 obj
8 that 9 nsubj:pass 10 nsubjpass
example, the words"内幕"and "丑闻"clearly have 9 was 5 acl:pass 10 aux:pass
a coordinate relationship, and either of them can 10 found 9 auxpass 7 acl:relcl
11 in 12 case 12 case
be considered as the root. The word "或"should be 12 Fallujah 10 obl 10 obl
identified as a coordinating conjunction, indicating 13 . 5 punct 5 punct

the coordination between"内幕"and "丑闻". If we Preposition-Case Ambiguity


1 Compare 0 root 0 root
assume "或" as the root, and both "内幕"and "丑 2 the 3 det 3 det
闻"as having an unknown relationship with "或", 3 flags 1 obj 1 obj
4 to 1 prep 7 case
it would still make sense. However, in gold annota- 5 the 6 det 7 det
6 Fallujah 4 pobj 7 compound
tions, by labeling "或" as a coordinating conjunc- 7 one 1 dobj 7 obl
tion, and simultaneously marking the relationship 8 . 1 punct 1 punct
between "内幕" and "丑闻" as unknown, it seems Auxiliary Verb Mislabeling
1 It 2 nsubj 3 expl
somewhat unreasonable. In summary, the issue lies 2 does 0 root 3 aux
in the inconsistencies and potential errors within 3 seem 2 ccomp 0 root
4 that 3 mark 7 mark
the gold annotations, making it challenging to es- 5 Iranians 6 nsubj 7 nsubj
tablish a reliable basis for comparison and analysis 6 frequently 3 advmod 7 advmod
7 make 6 conj 3 ccomp
against ChatGPT. 8 statements 7 dobj 7 obj
9 and 7 cc 11 cc
10 then 11 advmod 11 advmod
11 hide 7 conj 7 conj
12 behind 11 prep 13 case
13 lack 14 compound 11 obl
14 of 12 pobj 15 case
15 proof 14 nmod 13 nmod
16 . 2 punct 3 punct
Adjective Modifier Ambiguity
1 The 2 det 2 det
2 clerics 4 nsubj 3 nsubj
3 demanded 4 aux 0 root
4 talks 0 root 3 obj
5 with 6 case 8 case
6 local 7 amod 8 amod
7 US 8 compound 8 compound
8 commanders 4 obl 4 nmod
9 . 4 punct 3 punct

Table 6: The analysis examples in English.

snt_number snt_percentage
obj-dobj 355 35.86%
nmod-compound 145 14.65%
punct root 106 35.86%
nmod-dobj 56 10.71%
nummod-dep 26 2.63%

Table 7: The statistics of linguistic categories of CTB5.


ID Word Pred-Head Pred-Rel Gold-Head Gold-Rel
Obj-Dobj
1 维阿里 2 nsubj 2 nsubj
2 看好 0 root 0 root
3 欧文 2 obj 2 dobj
Quote Root
1 “ 0 root 2 punct
2 光头 1 nsubj 4 appos
3 ” 1 punct 2 punct
4 维阿里 2 flat 6 nsubj
5 现在 6 advmod 6 nmod:tmod
6 担任 1 ccomp 0 root
7 英格兰 8 flat 10 nmod:assmod
8 超级 9 compound 9 amod
9 联赛 10 nmod 10 compound:nn
10 俱乐部 6 obj 11 appos
11 切尔西队 10 nmod 13 nmod:assmod
12 的 11 case 11 case
13 教练 6 obj 6 dobj
14 。 1 punct 6 punct
Compound-Nmod
1 我 4 nsubj 5 nsubj
2 一点 3 det 5 advmod
3 也 4 advmod 5 advmod
4 不 5 advmod 5 neg
5 怀疑 0 root 19 dep
6 欧文 5 obj 15 nsubj
7 将 8 aux 15 advmod
8 是 5 ccomp 15 cop
9 未来 10 compound:nn 10 dep
10 几 11 nummod 15 dep
11 年 8 obl 10 mark:clf
12 内 11 case 10 case
13 真正 14 amod 15 amod
14 的 11 nmod 13 mark
15 巨星 8 obj 5 ccomp
16 , 5 punct 19 punct
17 ” 5 punct 19 punct
18 他 20 nsubj 19 nsubj
19 说 20 ccomp 0 root
20 。 5 punct 19 punct
Error Root
1 ( 0 root 5 punct
2 左 1 punct 5 dep
3 一 1 punct 5 dep
4 为 1 punct 5 dep
5 作者 4 punct 0 root
6 ) 1 punct 5 punct
Dobj-Nmod
1 发言人 2 nsubj 2 nsubj
2 主张 0 root 0 root
3 该国 5 nmod 8 nsubj
4 就 5 advmod 5 case
5 入侵 2 obj 8 nmod:prep
6 邻国 5 nmod 5 dobj
7 正式 8 advmod 8 advmod
8 道歉 2 ccomp 2 ccomp
Nummod-Dep
1 十 2 nummod 3 dep
2 面 4 nsubj 1 mark:clf
3 埋伏 4 compound:nn 0 root
4 , 2 punct 3 punct
5 创造 2 conj 3 conj
6 声势 5 obj 5 dobj
Gold Errors
1 内幕 3 nsubj 4 dep
2 、 1 punct 4 punct
3 或 0 root 4 cc
4 丑闻 3 conj 0 root
5 ? 3 punct 4 punct

Table 8: The analysis examples in Chinese.

You might also like