2412.00869v1

Exploring the Abilities of Large Language Models to Solve
Proportional Analogies via Knowledge-Enhanced Prompting
Thilini Wijesiriwardene1 , Ruwan Wickramarachchi1 , Sreeram Vennam2 , Vinija Jain4,5 * ,

Aman Chadha3,5† , Amitava Das1 , Ponnurangam Kumaraguru2 , Amit Sheth 1 ,
1
AI Institute, University of South Carolina, USA, 2 IIIT Hyderabad, India
3
Amazon GenAI, USA, 4 Meta, USA 5 Stanford University, USA
Correspondence: thilini@sc.edu
Abstract 2022; Ushio et al., 2021; Szymanski, 2017; Drozd

et al., 2016), sentence-analogies (Jiayang et al.,
Making analogies is fundamental to cognition.
2023; Afantenos et al., 2021; Zhu and de Melo,
Proportional analogies, which consist of four
2020; Wang and Lepage, 2020) and analogies of
arXiv:2412.00869v1 [cs.CL] 1 Dec 2024
terms, are often used to assess linguistic and

cognitive abilities. For instance, completing longer text (Sultan and Shahaf, 2022; Sultan et al.,
analogies like “Oxygen is to Gas as <blank> is 2024). Proportional analogies, which is the focus
to <blank>” requires identifying the semantic of this paper, are presented in the form A:B::C:D,
relationship (e.g., “type of”) between the first meaning A is to B as C is to D. These analogies
pair of terms (“Oxygen” and “Gas”) and finding involve four terms, where the relationship between
a second pair that shares the same relationship the first pair of terms (A and B) is similar to the
(e.g., “Aluminum” and “Metal”). In this work,
we introduce a 15K Multiple-Choice Question
relationship between the second pair of terms (C
Answering (MCQA) dataset for proportional and D).
analogy completion and evaluate the perfor- Generative Artificial Intelligence (GenAI) mod-
mance of contemporary Large Language Mod- els, particularly those recognized for their capac-
els (LLMs) in various knowledge-enhanced ity to generate high-quality textual outputs1 , have
prompt settings. Specifically, we augment emerged as a focal point of research in contempo-
prompts with three types of knowledge: ex- rary Natural Language Processing. The capabilities
emplar, structured, and targeted. Our results
of these models are typically evaluated through
show that despite extensive training data, solv-
ing proportional analogies remains challenging a range of tasks, including question answering
for current LLMs, with the best model achiev- (Arora et al., 2022; Kasai et al., 2023), reasoning
ing an accuracy of 55%. Notably, we find that (Zhang et al., 2024), paraphrasing (Witteveen and
providing targeted knowledge can better assist Andrews, 2019), sentiment analysis (Kheiri and
models in completing proportional analogies Karimi, 2023) and, more recently, analogical rea-
compared to providing exemplars or collections soning (Bhavya et al., 2024; Wijesiriwardene et al.,
of structured knowledge.
2023). Notably, Wijesiriwardene et al. (2023) have
1 Introduction demonstrated that SAT-style2 Proportional analo-
gies pose significant challenges for LLMs, partic-
The ability to form analogies enables humans to ularly when solved using intrinsic distance-based
transfer knowledge from one domain to another, similarity measures. Conversely, Webb et al. (2023)
making it a core component of human cognition have shown that GPT-3 can surpass human perfor-
(Hofstadter, 2001; Holyoak et al., 2001; Minsky, mance in solving proportional analogies, though
1988). Specifically, in analogy-making, the em- these findings were based on a dataset with limited
phasis is on the relations among objects, as it is size (774 data points) and a narrow range of dis-
the system of relations that is compared across tinct semantic relations among term pairs (seven
domains rather than the specific objects and their semantic relation types). Motivated by the need
attributes (Gentner, 1983). Researchers have iden- 1
In this work, Generative AI models refer to Large Lan-
tified several types of analogies within the domain guage Models (LLMs) capable of producing high-quality tex-
of NLP, such as proportional analogies (analogies tual content. Therefore, we use the term “GenAI Models” and
among word/term pairs) (Brown, 1989; Chen et al., LLMs interchangeably.
2
SAT is a US college admission test where proportional
* Work does not relate to position at Meta. analogies were used to assess linguistic and cognitive abilities
†
Work does not relate to position at Amazon. of examinees.
1
Types of Knowledge Prompting Techniques
No Knowledge Zero-shot Prompting
Question
Oxygen : Gas :: ? Few-shot Prompting
Exemplar Knowledge
(One-shot & Five-shot)
Answer Choices
1. Cobra : Venom 3. Aluminum : Metal
2. Doctor : Hospital Structured Knowledge Model
Structured Knowledge
3. Aluminum : Metal Prompting (SKP)
4. Airplane : Cloud
Targeted Knowledge Prompting

Targeted Knowledge
(TKP)
Figure 1: Knowledge-enhanced Prompting. An illustration of our knowledge-enhanced prompting approach with

types of knowledge and prompting techniques. The question consists of two terms (“Oxygen” and “Gas”), and
answer choices consist of term pairs that are analogous to the question term pair. Each model is queried using the
prompting techniques illustrated.
to broaden the scope of research, we scale up the prompting strategies for solving proportional analo-
evaluation by assessing a diverse set of GenAI mod- gies.
els on a larger, more comprehensive proportional Our findings indicate that completing propor-
analogy dataset. Additionally, we employ various tional analogies is highly challenging for current
prompting techniques enhanced with multiple types LLMs and incorporating targeted knowledge sig-
of knowledge to understand model capabilities in nificantly enhances model performance, with the
completing proportional analogies. best-performing model showing an improvement
of approximately +21% compared to prompts with-
Our primary contribution lies in conducting a
out any knowledge, and around +45% relative to
comprehensive evaluation of nine GenAI models,
prompts enhanced with structured knowledge. The
specifically assessing their performance in solv-
underperformance of SKP relative to Zero-shot
ing proportional analogies presented in a multiple-
Prompting suggests that the mere inclusion of rel-
choice format. Considering the limitations of exist-
evant knowledge may not always improve model
ing proportional analogy datasets, which typically
performance.
comprise fewer than a thousand data points and
a restricted range of relation types, we present a 2 Related Work
substantially larger dataset. Our dataset contains
15K proportional analogies with 238 distinct rela- In this section, we introduce related literature
tion types. We evaluate the nine GenAI models on the main topics of our paper: proportional
on the 15K dataset using four distinct prompting analogies and LLMs, prompting techniques, and
techniques: (i) Zero-shot Prompting, where no ad- knowledge-enhancement in LLM prompting.
ditional knowledge is incorporated into the prompt,
(ii) Few-shot Prompting, where exemplar knowl- 2.1 Proportional Analogies and LLMs
edge in the form of examples from the dataset is in- One of the earliest methods for solving propor-
cluded in the prompt, (iii) Structured Knowledge tional analogies was Latent Relational Analysis
Prompting (SKP), where the prompt is augmented (LRA), introduced by Turney (2005). LRA de-
with structured knowledge in the forms of lexi- termines analogy by measuring the similarity in
cal, commonsense, and world knowledge drawn semantic relationships shared between word pairs,
from WordNet (McCrae et al., 2019), Concept- considering them analogous if they exhibit a high
Net (Speer et al., 2017), and Wikidata (Vrandečić degree of relational similarity. With the advent
and Krötzsch, 2014) respectively and (iv) Targeted of neural networks, vector difference-based meth-
Knowledge Prompting (TKP), which integrates ods (Vylomova et al., 2016; Allen and Hospedales,
targeted knowledge in the form of specific seman- 2019; Mikolov et al., 2013) were used to address
tic relationships necessary for solving proportional proportional analogies. As LLMs based on the
analogies, along with the cognitive process behind Transformer architecture (Vaswani et al., 2017a)
such reasoning. To the best of our knowledge, this gained prominence, researchers began investigat-
study is the first to explore knowledge-enhanced ing the potential of LLMs, particularly Generative
2
Artificial Intelligence (GenAI) models, for solving vide an exemplar. Instead, it enhances the prompt
proportional analogies (Brown, 2020; Ushio et al., with the targeted knowledge specific to solving
2021; Webb et al., 2023). Specifically, Webb et al. proportional analogies. As a result, TKP is more
(2023) demonstrated strong performance using a akin to Zero-shot-CoT (Kojima et al., 2022) than
single model (GPT-3) on four relatively small pro- to traditional CoT (Wei et al., 2022).
portional analogy datasets. Our study extends this The enhancement of LLM performance through
work by scaling up the evaluation to a substantially the integration of external knowledge, both unstruc-
larger dataset and by assessing nine contemporary tured and structured, has been extensively studied
GenAI models across six distinct prompting ap- (Yu et al., 2022). Some approaches transform ex-
proaches. Additionally, we introduce a novel external knowledge from multiple documents into
ploration of the impact of incorporating various graph structures and utilize these graphs to enhance
types of knowledge when evaluating GenAI mod- LLM querying (Wang et al., 2024). Additionally,
els on proportional analogies. some methods directly employ structured knowl-
edge (Baek et al., 2023). Retrieval-augmented gen-
2.2 Prompting and Knowledge-enhanced
eration (RAG) has recently emerged as an umbrella
Prompting
term encompassing all these techniques, where
GenAI models are built on LLMs that are trained on user queries are enriched with content retrieved
extensive datasets and optimized for various tasks, from external sources to enhance model perfor-
including question-answering. This training im- mance (Lewis et al., 2020; Ding et al., 2024; Mi-
plies that these models encapsulate the knowledge alon et al., 2023; Schulhoff et al., 2024). In this
in the data, allowing them to effectively answer work, we utilize multiple types of knowledge, in-
natural language queries (Roberts et al., 2020; Zhu cluding targeted and structured knowledge (from
and Li, 2023). Prompting involves transforming three sources), to assess the impact on LLM per-
an input query into a structured natural language formance in solving proportional analogies. To
statement (prompt) and presenting it to the model, the best of our knowledge, this is the first study
which then guides the output generation process to explore the capabilities of LLMs in solving pro-
of the model. (Schulhoff et al., 2024; Hadi et al., portional analogies using knowledge-enhancement
2023; Liu et al., 2023). Generating outputs through approaches.
prompting requires only forward passes during in-
ference time, without any weight updates. Prompts 3 Approach
can be created either manually (Wei et al., 2022;
Schulhoff et al., 2024) or automatically (Ye et al., As illustrated in Figure 1, given a proportional
2023; Reynolds and McDonell, 2021; Deng et al., analogy MCQ where the question consists of a
2022); in this work, we employ the more intuitive single term pair (e.g., “Oxygen” and “Gas”),
manual approach. the GenAI model is required to provide the cor-
Prompts can be categorized based on the con- rect answer choice from five, four or three choices.
text they provide. Zero-shot prompts (Brown, Zero-shot Prompting, only include the MCQ and
2020) contain only instructions related to solv- a simple instruction on how to produce the out-
ing a specific task, whereas Few-shot prompts put without any knowledge enhancement added
(Brown, 2020) include both the instructions and to the prompt. Next, we enhance the Zero-shot
one or more examples. Providing examples when Prompt with exemplars of solved MCQs from the
querying models is a paradigm broadly known as dataset. We consider this approach as enhanc-
In-context Learning (ICL) (Brown, 2020). Chain- ing the prompt with “exemplar knowledge” and
of-Thought (CoT) Prompting is designed to guide refer to this prompting technique as Few-shot
models through the reasoning process required to Prompting. We experiment with one exemplar
solve a task by presenting an exemplar that includes (One-shot Prompting) and five exemplars (Five-
the question, reasoning path, and correct answer shot Prompting). Then a combination of lexi-
(Wei et al., 2022) or by just incorporating a thought- cal, commonsense, and world knowledge from
inducing phrase such as “Let’s think step by step” structured sources—WordNet, ConceptNet, and
(Kojima et al., 2022) (Zero-shot-CoT). Unlike con- Wikidata, respectively—is added to the Zero-shot
ventional CoT prompting, which often includes an Prompts for knowledge enhancement, resulting in
exemplar, our adaptation termed TKP does not pro- what we call SKP. Finally, the zero-shot prompt
3
is enhanced with targeted knowledge and we iden- output. Current state-of-the-art GenAI models
tify this prompting technique as TKP. Targeted are largely based on the Transformer architecture
knowledge is composed of, the semantic relation- (Vaswani et al., 2017b). In this work we compare
ship shared between the question term pair and the the following popular open-source and proprietary
cognitive process behind solving the proportional GenAI models for their ability to solve propor-
analogy. We detail the prompting techniques in tional word analogy MCQs by incorporating vari-
Section 3.3. ety of knowledge: (i) Falcon, a causal decoder-only
model (Almazrouei et al., 2023), (ii) FlanT5 (Long-
3.1 Dataset Creation
pre et al., 2023), a T5 (Raffel et al., 2020a) based
We introduce a 15K dataset of proportional analo- model trained on the Flan collection of datasets,
gies containing 5-way, 4-way and 3-way MCQs. (iii) GPT2 (Radford et al., 2019a), the first series of
Table 1 presents the dataset statistics along with models to popularize in-context instructions, (vi)
examples from the dataset. We generate 14K ques- Mistral (Jiang et al., 2023), leveraging transformers
tions out of the 15K based on the work by (Yuan architecture (Vaswani et al., 2017b) with several
et al., 2023). Yuan et al. (2023) introduced a an au- new introductions such as sliding window attention
tomatically generated million-scale analogy knowl- and pre-fill chunking, (v) Orca (Mukherjee et al.,
edge base with diverse relational structures among 2023), based on LLaMA model family (Touvron
the analogous term pairs. We adopt this resource to et al., 2023) and fine-tuned on complex explanation
develop n-way (n=[3, 4, 5]) MCQs as follows. traces obtained from GPT-4 (Achiam et al., 2023),
A single n-way MCQ consist of a pair of terms (vi) Zephyr (Tunstall et al., 2023), a fine-tuned
representing the question and five term pairs rep- version of Mistral trained on public datasets and
resenting the answer choices, among which only optimized with knowledge distillation techniques.
one term pair is the correct answer. The semantic (vii) CodeT5 (Wang et al., 2021c), a unified pre-
relationship between the term pair in the question trained encoder-decoder transformer model lever-
is the same as the semantic relationship shared be- aging code semantics and finally (viii) CodeParrot
tween the term pair which is the correct answer. (Jain, 2023), a model based on GPT-2 and trained
The rest of the incorrect answer choices consist to generate python code (ix) GPT-3.5-Turbo 4 . Fur-
of term pairs with different semantic relationships ther details of the models used are presented in
among them. Appendix A.
Thousand data points out of the 15K are bor-
rowed from work by Ushio et al. (2021); Tur- 3.3 Prompting Techniques
ney and Littman (2003); Boteanu and Chernova Currently, the most popular approach to Multiple
(2015)3 and contain 5-way, 4-way and 3-way Choice Question Answering (MCQA) is via cloze-
MCQs. We highlight that, compared to previous style prompting (Brown, 2020; Robinson et al.,
proportional analogy MCQ datasets used for re- 2023) where each answer choice is concatenated to
search (Webb et al., 2023; Ushio et al., 2021), the the question separately and scored independently
current dataset provides a significant increase in by the language model (LM). This style of prompt-
question quantity (∼15-times) and diversity (with ing is problematic since it prevents the LM from
regard to diversity of semantic relations among comparing and contrasting all available options si-
terms). Our dataset also includes the semantic rela- multaneously. Additionally, it is computationally
tionship shared by the question term pair compared expensive, as it requires multiple forward passes
to other datasets that do not include this informa- through the LM to identify the correct answer
tion (Ushio et al., 2021; Turney and Littman, 2003; (Robinson et al., 2023). To address these limita-
Boteanu and Chernova, 2015). tions, we adopt the prompt phrasing introduced by
Robinson et al. (2023) with task-specific modifi-
3.2 Model Details
cations. Specifically, the question and its symbol-
GenAI models are designed to generate content that enumerated candidate answers are provided to the
are often indistinguishable from human-produced model as a single prompt. Robinson et al. (2023)
3
Unlike the 14K MCQs created based on AnalogyKB, do not include specific instructions in the prompt
these 1K data points do not provide the semantic relationship for the model to output only the choice symbol. But
shared between the question term pair explicitly, therefore we
4
employ two NLP researchers to discuss and manually identify https://platform.openai.com/docs/models/
the shared semantic relationship. gpt-3-5-turbo
4
Questions Relations
Question Type (MCQ) 5-way 4-way 3-way Top 5 Relation Types # Data Points
Question: Tenable: Indefensible Question: Haiku: Poem Question:Ancient: Old
Choices: Choices: Choices: part of 1020
Example
(1) Unique : Unprecedented (1) Song : Musician (1) Crazy : Unhealthy is a 702
(2) Dire : Pressing (2) Novel : Book (2) Delicious : Tasty at location 518
(3) Bleak : Desolate (3) Artist : Painting (3) Smart : Intelligent follows 376
(4) Theoretical : Concrete (4) Page : Typeface producer 374
(5) Recondite : Scholarly
Amount 14386 610 4 Total # relation types 238
Table 1: Dataset statistics. The dataset consist of 15K MCQs that share 238 semantic relation types among them.
we observe that adding such specific instructions three types of knowledge: (i) Wikidata (Vrandečić
reduce the model hallucinations. Therefore we use and Krötzsch, 2014), which provides world knowl-
specific, non-ambiguous language to instruct the edge in the form of explicit information about spe-
model to only output the relevant choice symbol. cific instances, encompassing billions of nodes
The prompting techniques are detailed below (See and edges (Wang et al., 2021a); (ii) ConceptNet
example prompts in appendix C). (Speer et al., 2017), a general-domain common-
sense knowledge graph with 799,273 nodes and
3.3.1 Zero-shot Prompting 2,487,810 edges; and (iii) WordNet (McCrae et al.,
In Zero-shot Prompting, the question, all multiple 2019), a lexical database for the English language
choice answers and the instructions are provided in containing 161,338 words, 120,135 synsets, and
natural language (no knowledge is provided). 415,905 semantic relations.
We retrieve knowledge from above sources as
3.3.2 Few-shot Prompting
follows. Since analogies focus on relations oppose
We demonstrate the task to the model by providing to entities or entity attributes (Gentner, 1983), when
several exemplars in the form of question, answer retrieving knowledge from knowledge sources we
choices and the correct answer choice. Then the focus on path finding approaches oppose to sub-
actual question and answer choices are provided graph extraction approaches. To extract both world
requiring the model to choose the correct answer and commonsense knowledge, we utilize the path-
choice. We employ one-shot and five-shot prompt- finding approach by Lin et al. (2019) that identifies
ing under the few-shot prompting strategy where connections between each term pair (in both the
one example and five examples are provided re- question and answer choices). Specifically, we
spectively. We select these quantities of exemplars extract paths of length k6 from ConceptNet and
to strike a balance between the models’ maximum Wikidata. When retrieving lexical knowledge from
accepted context length and the computational re- WordNet, we extract the shortest path between term
sources required. To obtain the exemplars, we em- pairs.
ploy a semantic similarity based filtering mech-
anism as follows. We encode each proportional Knowledge Filtering. For each term pair in the
analogy MCQ in the dataset using a SOTA sen- question and answer choices, multiple knowledge
tence encoding transformer model5 , and identify paths may be retrieved. To ensure the prompts stay
the most semantically similar single example/ five within the maximum context length limit of the
examples based on Cosine similarity. evaluated language models, we filter the retrieved
paths and retain a single path for Wikidata and Con-
3.3.3 Structured Knowledge Prompting (SKP) ceptNet (See Figure 2). Filtering is not performed
We retrieve knowledge from structured sources, on WordNet since a single path (shortest) is always
filter it, and then integrate the resulting refined retrieved.
knowledge into the prompts. We detail this process The filtering mechanisms we employ are as
in the subsequent sections. follows: (i) Random Filtering, where one path
is randomly selected from the list of available
Knowledge Retrieval. We leverage the follow- paths; and (ii) Semantic Filtering, which selects
ing widely-used large knowledge sources to obtain the path most semantically similar to the term
5 6
https://huggingface.co/sentence-transformers/ k is set to 2 for Wikidata and 3 for ConceptNet, as longer
all-mpnet-base-v2 paths tend to introduce excessive noise and reduce efficiency.
5
Knowledge Sources
Knowledge
Retrieved Paths prompting technique (Wei et al., 2022) to provide
Retrieval the model with “targeted knowledge” in the form
or
of (i) semantic relationship shared by the question
ConceptNet WikiData
term pair (ii) cognitive process used by humans
when evaluating such analogies, via the prompt.
Term 1 Term 2 Random Semantic
"Term 1 semantically 4 Experimental Setting

related to Term 2"
Semantic Similarity We have conducted a comprehensive set of experi-

ments across nine GenAI models over six prompt
Selected Knowledge Path
variants on a 15K dataset, totalling to 54 (9X6)
Figure 2: An illustration of the knowledge filtering experiments.
approach. “Random” indicates Random Filtering and
“Semantic” indicates Semantic Filtering. 4.1 Implementation Details
We use API requests for GPT-3.5-Turbo and check-
points from Hugging face9 for open-source models.
pairs. The term pairs (in question and answer
The models are evaluated with following hyper pa-
choices) are formatted to “term pair sentences”
rameter settings, temperature = 0.1, top_p=0.1 and
in the following form < TERM _1> IS SEMANTI -
repetition_penalty=1.2 to elicit more concrete an-
CALLY RELATED TO < TERM _2> and returned
swers for the MCQs. We use Sentence Transform-
paths are also formatted to “path sentences” in the
ers10 to identify semantically similar exemplars and
form of [< NODE 1_ NAME > < RELATION 1_ NAME >
to perform semantic knowledge filtering. We uti-
< NODE 2_ NAME >, < NODE 2_ NAME > < RELA -
lize Wikidata knowledge from (Wang et al., 2021b),
TION 2_ NAME > < NODE 3_ NAME >, ...]. Both term
ConceptNet knowledge from conceptnet511 and
pair sentences and path sentences are then en-
WordNet knowledge from Open English WordNet
coded using a SOTA sentence encoding transformer
(2023)12 .
model7 and the path sentence with the highest co-
sine similarity to term pair sentence is filtered as 5 Results and Discussion
relevant knowledge and referred to as knowledge
paths8 . Proportional analogy multiple-choice questions
(MCQs) are presented to each GenAI model using
Generating Prompt. The filtered knowledge the previously described prompts. The model’s re-
paths are appended to the zero-shot prompt after the sponse is extracted from the generated output, and
question and the answer choices to create the SKP accuracy is measured using Exact Match Accuracy
and the model is instructed to use the knowledge if (EMA) (Rajpurkar et al., 2016). While more flex-
necessary. Based on the knowledge filtering mech- ible evaluation metrics such as BLEU (Papineni
anism SKP can be referred to as SKP[random] or et al., 2002) and ROUGE (Lin, 2004) are com-
SKP[semantic]. monly used to assess GenAI-generated outputs, we
employ EMA because MCQs are inherently evalu-
3.3.4 Targeted Knowledge Prompting (TKP)
ated in a binary manner, where partial correctness
When solving proportional analogies, humans typi- is not rewarded. We report EMA as a percentage
cally examine the question term pair, identify the for each model and prompt variant. The results are
semantic relationship between the two terms, and presented in Table 2.
select the answer pair that shares the same or a sim-
ilar relationship. Inspired by this cognitive process, 5.1 Model Performance and Prompting
we modify the traditional Chain-of-Thought (CoT) Techniques
7
https://huggingface.co/sentence-transformers/
The highest overall performance was attained by
all-mpnet-base-v2 GPT-3.5-Turbo, achieving an EMA of 55.25%.
8
specific format of Wikidata knowledge paths is 9
[<node1_name> <relation1_name> <node2_name>, https://huggingface.co/models
10
<node2_name> <relation2_name> <node3_name>] https://sbert.net/
11
and ConceptNet knowledge path is [<node1_name> https://github.com/commonsense/conceptnet5/
<relation1_name> <node2_name>, <node2_name> wiki/Downloads
12
<relation2_name> <node3_name>, <node3_name> https://github.com/globalwordnet/
<relation3_name> <node4_name>] english-wordnet?tab=readme-ov-file
6
Few-shot Prompting Structured Knowledge Prompting
Model Name Zero-shot Prompting Targeted Knowledge Prompting
One-shot Five-shot Random Semantic
Falcon 24.17 23.21 22.61 24.75 25.01 25.4
FlanT5 36.47 40.09 38.07 14.43 14.62 44.26
GPT2 22.65 22.49 7.19 6.29 6.17 21.64
Mistral 26.59 26.22 27.34 24.58 24.42 27.37
Orca 24.54 23.28 14.11 18.48 18.81 24.2
Zephyr 29.46 34.05 35.87 16.13 17.22 15.83
CodeT5 20.64 24.33 0 16.15 17.47 21.64
CodeParrot 0 10.11 12.6 0 0.01 2.09
GPT-3.5-Turbo 45.7 31.79 41.21 38.29 38.79 55.25
Table 2: MCQ Performance of models. Performance is reported in EMA percentage. Best performance of each
model is indicated in bold and the second best performance is indicated by underline.
This result underscores the challenge that pro- additional experiments to evaluate the individual
portional analogies pose for current state-of-the- contribution of each knowledge type to EMA (we
art GenAI models. This accuracy was obtained employed SKP[semantic ] prompting). Our results
through Targeted Knowledge Prompting where the show (See table 3 in Appendix B) that incorporat-
prompt was enhanced with targeted knowledge. ing each of the three knowledge types separately
Interestingly, the same model, when enhanced into prompts leads to very similar EMA values
with structured knowledge, underperformed with (when averaged across all nine models). Specifi-
an accuracy of 38% (EMA for SKP[random] is cally, prompts enhanced only with Wikidata knowl-
38.29% and SKP[semantic] is 38.79%), compared edge resulted in an average EMA of 14.57%, while
to Zero-shot prompting (EMA 45.7%). This sug- using only WordNet or only ConceptNet yielded av-
gests that simply adding knowledge, even from erage EMAs of 14.41% and 14.34%, respectively.
diverse sources, may not be beneficial for cogni- We also observed that incorporating all three
tively demanding tasks such as proportional anal- types of knowledge simultaneously into the
ogy completion. Out of the nine models four (Fal- prompts, compared to using them individually,
con, Flan, Mistral and GPT-3.5-Turbo) performs produced varying results. For example, Falcon,
the best when prompted with Targeted Knowledge CodeT5 and GPT-3.5-Turbo perform marginally
Prompts and two (GPT2 and Orca) performs the better when a single knowledge type is incorpo-
best with Zero-shot prompts with no knowledge rated into the prompt, compared to including all
enhancement. CodeT5 performs the best with one- three knowledge types simultaneously (see Figure
shot prompts and Zephyr and CodeParrot performs 3). Providing FlanT5 with a single knowledge
the best with five-shot prompts. We also observe type compared to all three knowledge types con-
that models trained specifically on code generation tributes to significant increases of percentage points
such as CodeT5 and CodeParrot (specially Code- in EMA (WordNet +18.51 , ConceptNet +15.96
Parrot) perform at the lower end of the spectrum and WikiData +7.44). In contrast, GPT-2, Mis-
despite the demonstrated abilities of them to per- tral, and Orca perform better when all knowledge
form well on other MCQ datasets Robinson et al. types are integrated into the prompt. Notably, Orca
(2023). We believe this is due to the challenging demonstrates an average EMA increase of +11.14
nature of the proportional analogy completion task. percentage points compared to using only a single
knowledge source.
5.2 Role of Structured Knowledge in Model
Performance 5.3 Exemplar Quantity vs. Model
Although enhancing prompts with structured Performance
knowledge does not consistently improve model Brown (2020) demonstrated that the accuracy of
performance compared to other prompting tech- large language models improves with an increase
niques, SKP[semantic] leads to slight increases in the number of exemplars. However, Liu et al.
in EMA values (ranging from 0.01% to 1.32%) (2022) found that the benefits diminish beyond 20
compared to SKP[random], across all models ex- exemplars in certain cases. Similarly, in our study,
cept GPT-2 and Mistral (see Table 4). We identi- increasing exemplars from one to five decreases
fied a subset of MCQs (19.96%) where all three EMA in six out of nine models (see Table 2), lead-
types of knowledge were available and conducted ing us to limit exemplars to a maximum of five.
7
26
64
37
2
83
64
9
25
35
WikiData ConceptNet WordNet All
30
ion
24.14 25
EMA(%)
12.19 20
5.84
24.11
15
24 10
14.76
5
7
0 0
Falcon FlanT5 GPT2 Mistral Orca Zephyr CodeT5 CodeParrot GPT-3.5-Turbo
Model Name
18.51 12.19 Figure 3: Perfromance with structured knowledge. Performance of each model when Structured Knowledge
9.28 22.24
12.96 11.14
Prompting with semantic filtering (SKP[semantic]) is used. All indicates the prompt is enhanced with all three types
of knowledge (Wikidata, ConceptNet and WordNet). EMA values are reported on 20% of the 15K dataset where
all three knowledge types available.
Prompting styles
Structured General Knowledge Targeted
1000 Knowledge
844.46
Average Prompt Length (Tokens)
5.4 Cost of Knowledge Acquisition vs. Model No

Knowledge
Exemplar Structured Targeted
750
ot zeroshot-random zeroshot-filtered cota-zeroshot cota-fewshot 0.60
1% 24.75%
Performance
25.01% 25.40% 25.67% GPT3.5-Turbo
500
7% 14.43% 14.62% 44.26% 37.31% 0.50 GPT3.5-Turbo
282.918
9% 6.29% In this study, we21.64%
6.17% utilize
250
three types of knowledge
272.26
19.32% 175.68 GPT3.5-Turbo
109.86 0.40
4% 24.58% 24.42%
to enhance prompts: 27.37% exemplar 26.35% knowledge, struc- FlanT5
EMA(%)
0 GPT3.5-Turbo GPT3.5-Turbo
1% 18.48% 18.81% 24.20% 25.94%
0.30
7% 16.13% tured knowledge,15.83%
17.22% and Zero-shot
targeted One-shot
34.24%
Five-shot Structured Targeted
knowledge. Among
Knowledge Knowledge
0% 16.15% 17.47% 21.64% 25.67% (semantic)
these, exemplar knowledge has the least
Prompting acquisition
Techniques 0.20 CodeParrot
0% 0.00% 0.01% 2.09% 0.04% CodeParrot CodeParrot/ CodeParrot
1% 38.29% cost38.79%
since it is readily 55.25% available
53.03% from the dataset 0.10
Code-T5
CodeParrot CodeParrot
itself requiring no additional resources. Structured 0.00
t t t ] ] P
knowledge, on the other hand, is more expensive to -sh
o
-sh
o
- sh
o m tic TK
do an
ero One F i ve Ran em
Z P[ S
acquire
Performingbecause it necessitates accessing
Promptingexternal SK P[
Best Models
Model for Each Knowledge-enchanced Style SK Best Performing Model
Prompting Styles Least Performing Model
knowledge bases or graphs and filtering knowledge,
Best Performing
Knowledge-Enhanced Prompting Style Model
Least Performing Model
which incurs computational
Zero-shot 45.70% overhead.
12.60% GPT3.5-Turbo Targeted
CodeParrot
Figure 4: Best and least performing models for each
One-shot 40.09% 10.11% FlanT5 CodeParrot
knowledge is the41.21%
Five-shot costliest to 0.00%acquire, as itCodeParrot/Code-T5
GPT3.5-Turbo involves prompting technique.
SKP[Random]
identifying the 38.29% specific semantic 0.00% GPT3.5-Turbo CodeParrotbe-
relationship
SKP[Semantic] 38.79% 0.01% GPT3.5-Turbo CodeParrot
TKPtween the question 55.25% term pairs. 2.09%This semantic
GPT3.5-Turbo rela-
CodeParrot
tionship is not always readily available, requiring structured knowledge.
human annotation (for instance, in our dataset of Even though the several models used in the cur-
15K data points, 1K data points lacked this seman- rent study are instruction finetuned versions of their
tic information, necessitating human annotation). base models, they are not necessarily finetuned for
As shown in Table 2, targeted knowledge, be- the proportional analogyExemplar
0.60
completion task. There-
Structured Targeted
ing the most expensive to acquire, led to the best fore, there is room to explore the performance
0.50 GPT3.5-Turbo
performance in four models (Falcon, FlanT5, Mis- of models specifically finetuned for proportional
GPT3.5-Turbo
GPT3.5-Turbo
tral and GPT-3.5-Turbo) including the peak perfor- analogy completion. 0.40 Our study currently evaluate
FlanT5
mance (55% EMA) from GPT-3.5-Turbo. In con- manual prompting techniques which are known
GPT3.5-Turbo to
GPT3.5-Turbo
Accuracy
0.30
trast, structured knowledge, the second most costly, be brittle. We believe that evaluating automatic
0.20
did not result in any model’s best performance. Al- prompting approaches for the same task would be
though exemplar knowledge is the least expensive, compelling. 0.10 CodeParrot CodeParro CodeParrot
CodeParrot CodeParrot
CodeParrot
hot Five-shot SKP [Random]
three modelsSKP [Semantic]
performed TKP best with it (Zephyr and t/Code-T5
0.00
23 0.23 0.25 0.25 0.25 7 Limitations
SKP[Semantic]
Zero-shot
One-shot
TKP
Five-shot
SKP[Random]
0.4 0.38 CodeParrot0.14 in Five-shot;

0.15 CodeT5
0.44 in One-shot).
22 0.07 0.06 0.06 0.22
26 0.27 0.25 0.24 0.27 When incorporating knowledge paths in SKP, the
23 0.14 6 Conclusion and Future Work
0.18 0.19 0.24
34 0.36 0.16 0.16 exact semantic relationship between the question
Best Performing Model
Least Performing Model
24 0 We explore the ability of nine LLMs to solve 0.22 term pair may sometimes be retrieved from the
0.1 0 0.02 Prompting Styles
32 0.41 proportional
0.38 analogies
0.3879 via different knowledge- 0.55 knowledge sources. We define this specific infor-
enhanced prompting techniques on a 15K MCQ mation as targeted knowledge. As a result, these
dataset that we introduce. Our extensive experi- instances can be categorized as SKP with targeted
ments show that most LLMs perform their best knowledge. However, we do not currently verify
when targeted knowledge is incorporated into or adjust for them.
prompts as opposed to exemplar knowledge and In this work, we employed manual prompt cre-
8
ation, where slight variations can lead to significant Adrian Boteanu and Sonia Chernova. 2015. Solving
differences in model outputs (Zhao et al., 2021). and explaining analogy questions using semantic net-
works. In Proceedings of the AAAI Conference on
We did not account for this variability in the cur-
Artificial Intelligence, volume 29.
rent study by experimenting with several prompt
templates for each prompting technique. Tom B Brown. 2020. Language models are few-shot
learners. Annual Conference on Neural Information
Processing Systems.
References William R Brown. 1989. Two traditions of analogy.
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Informal Logic, 11(3).
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Jiangjie Chen, Rui Xu, Ziquan Fu, Wei Shi, Zhongqiao
Diogo Almeida, Janko Altenschmidt, Sam Altman, Li, Xinbo Zhang, Changzhi Sun, Lei Li, Yanghua
Shyamal Anadkat, et al. 2023. Gpt-4 technical report. Xiao, and Hao Zhou. 2022. E-kar: A benchmark for
arXiv preprint arXiv:2303.08774. rationalizing natural language analogical reasoning.
Stergos Afantenos, Tarek Kunze, Suryani Lim, Henri In Findings of the Association for Computational
Prade, and Gilles Richard. 2021. Analogies between Linguistics: ACL 2022, pages 3941–3955.
sentences: Theoretical aspects-preliminary experi- Hyung Won Chung, Le Hou, Shayne Longpre, Barret
ments. In Symbolic and Quantitative Approaches to Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang,
Reasoning with Uncertainty: 16th European Con- Mostafa Dehghani, Siddhartha Brahma, Albert Web-
ference, ECSQARU 2021, Prague, Czech Republic, son, Shixiang Shane Gu, Zhuyun Dai, Mirac Suz-
September 21–24, 2021, Proceedings 16, pages 3–18. gun, Xinyun Chen, Aakanksha Chowdhery, Sharan
Springer. Narang, Gaurav Mishra, Adams Yu, Vincent Zhao,
Yanping Huang, Andrew Dai, Hongkun Yu, Slav
Carl Allen and Timothy Hospedales. 2019. Analogies
Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam
explained: Towards understanding word embeddings.
Roberts, Denny Zhou, Quoc V. Le, and Jason Wei.
In International Conference on Machine Learning,
2022. Scaling instruction-finetuned language models.
pages 223–231. PMLR.
arXiv preprint.
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al-
Mingkai Deng, Jianyu Wang, Cheng-Ping Hsieh, Yi-
shamsi, Alessandro Cappelli, Ruxandra Cojocaru,
han Wang, Han Guo, Tianmin Shu, Meng Song, Eric
Mérouane Debbah, Étienne Goffinet, Daniel Hess- Xing, and Zhiting Hu. 2022. RLPrompt: Optimizing
low, Julien Launay, Quentin Malartic, et al. 2023. discrete text prompts with reinforcement learning.
The falcon series of open language models. arXiv In Proceedings of the 2022 Conference on Empiri-
preprint arXiv:2311.16867. cal Methods in Natural Language Processing, pages
Yuvanesh Anand, Zach Nussbaum, Brandon Duder- 3369–3391, Abu Dhabi, United Arab Emirates. As-
stadt, Benjamin Schmidt, and Andriy Mulyar. 2023. sociation for Computational Linguistics.
Gpt4all: Training an assistant-style chatbot with large Yujuan Ding, Wenqi Fan, Liangbo Ning, Shijie Wang,
scale data distillation from gpt-3.5-turbo. https: Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li.
//github.com/nomic-ai/gpt4all. 2024. A survey on rag meets llms: Towards retrieval-
Simran Arora, Avanika Narayan, Mayee F Chen, Lau- augmented large language models. arXiv preprint
rel Orr, Neel Guha, Kush Bhatia, Ines Chami, and arXiv:2405.06211.
Christopher Re. 2022. Ask me anything: A sim- Aleksandr Drozd, Anna Gladkova, and Satoshi Mat-
ple strategy for prompting language models. In The suoka. 2016. Word embeddings, analogies, and ma-
Eleventh International Conference on Learning Rep- chine learning: Beyond king-man+ woman= queen.
resentations. In Proceedings of coling 2016, the 26th international
Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023. conference on computational linguistics: Technical
Knowledge-augmented language model prompting papers, pages 3519–3530.
for zero-shot knowledge graph question answering. Dedre Gentner. 1983. Structure-mapping: A theoretical
In Proceedings of the 1st Workshop on Natural framework for analogy. Cognitive science, 7(2):155–
Language Reasoning and Structured Explanations 170.
(NLRSE), pages 78–106, Toronto, Canada. Associa-
tion for Computational Linguistics. Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah,
Muhammad Irfan, Anas Zafar, Muhammad Bilal
Bhavya Bhavya, Shradha Sehgal, Jinjun Xiong, and Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili,
ChengXiang Zhai. 2024. AnaDE1.0: A novel data et al. 2023. Large language models: a comprehensive
set for benchmarking analogy detection and extrac- survey of its applications, challenges, limitations, and
tion. In Proceedings of the 18th Conference of the future prospects. Authorea Preprints.
European Chapter of the Association for Computa-
tional Linguistics (Volume 1: Long Papers), pages Douglas R Hofstadter. 2001. Analogy as the core of
1723–1737, St. Julian’s, Malta. Association for Com- cognition. The analogical mind: Perspectives from
putational Linguistics. cognitive science, pages 499–538.
9
K Holyoak, Dedre Gentner, and B Kokinov. 2001. The Chin-Yew Lin. 2004. Rouge: A package for automatic
place of analogy in cognition. The analogical mind: evaluation of summaries. In Text summarization
Perspectives from cognitive science, 119. branches out, pages 74–81.
Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan,
Allamanis, and Marc Brockschmidt. 2019. Code- Lawrence Carin, and Weizhu Chen. 2022. What
searchnet challenge: Evaluating the state of semantic makes good in-context examples for GPT-3? In
code search. arXiv preprint arXiv:1909.09436. Proceedings of Deep Learning Inside Out (DeeLIO
2022): The 3rd Workshop on Knowledge Extrac-
Royal Jain. 2023. codeparrot (CodeParrot). tion and Integration for Deep Learning Architectures,
Albert Q Jiang, Alexandre Sablayrolles, Arthur Men- pages 100–114, Dublin, Ireland and Online. Associa-
sch, Chris Bamford, Devendra Singh Chaplot, Diego tion for Computational Linguistics.
de las Casas, Florian Bressand, Gianna Lengyel, Guil-
laume Lample, Lucile Saulnier, et al. 2023. Mistral Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paran-
7b. arXiv preprint arXiv:2310.06825. jape, Michele Bevilacqua, Fabio Petroni, and Percy
Liang. 2024. Lost in the middle: How language mod-
Cheng Jiayang, Lin Qiu, Tsz Chan, Tianqing Fang, els use long contexts. Transactions of the Association
Weiqi Wang, Chunkit Chan, Dongyu Ru, Qipeng for Computational Linguistics, 12:157–173.
Guo, Hongming Zhang, Yangqiu Song, et al. 2023.
Storyanalogy: Deriving story-level analogies from Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang,
large language models to unlock analogical under- Hiroaki Hayashi, and Graham Neubig. 2023. Pre-
standing. In Proceedings of the 2023 Conference on train, prompt, and predict: A systematic survey of
Empirical Methods in Natural Language Processing, prompting methods in natural language processing.
pages 11518–11537. ACM Computing Surveys, 55(9):1–35.
Jungo Kasai, Keisuke Sakaguchi, yoichi takahashi, Shayne Longpre, Le Hou, Tu Vu, Albert Webson,
Ronan Le Bras, Akari Asai, Xinyan Velocity Yu, Hyung Won Chung, Yi Tay, Denny Zhou, Quoc V Le,
Dragomir Radev, Noah A. Smith, Yejin Choi, and Barret Zoph, Jason Wei, and Adam Roberts. 2023.
Kentaro Inui. 2023. Realtime QA: What’s the an- The flan collection: Designing data and methods for
swer right now? In Thirty-seventh Conference on effective instruction tuning. In Proceedings of the
Neural Information Processing Systems Datasets and 40th International Conference on Machine Learning,
Benchmarks Track. volume 202 of Proceedings of Machine Learning
Research, pages 22631–22648. PMLR.
Kiana Kheiri and Hamid Karimi. 2023. Sentimentgpt:
Exploiting gpt for advanced sentiment analysis and Pankaj Mathur. 2023. orca_mini_7b: An ex-
its departure from current machine learning. arXiv plain tuned openllama-7b model on cus-
preprint arXiv:2307.10234. tom wizardlm, alpaca, and dolly datasets.
https://github.com/pankajarm/wizardlm_
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- alpaca_dolly_orca_open_llama_7b,
taka Matsuo, and Yusuke Iwasawa. 2022. Large lan- https://https://huggingface.co/psmathur/
guage models are zero-shot reasoners. Advances in wizardlm_alpaca_dolly_orca_open_llama_7b.
neural information processing systems, 35:22199–
22213. John P. McCrae, Alexandre Rademaker, Francis Bond,
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Ewa Rudnicka, and Christiane Fellbaum. 2019. En-
Savarese, and Steven Hoi. 2022. CodeRL: Master- glish WordNet 2019 – an open-source WordNet for
ing code generation through pretrained models and English. In Proceedings of the 10th Global Word-
deep reinforcement learning. In Advances in Neural net Conference, pages 245–252, Wroclaw, Poland.
Information Processing Systems. Global Wordnet Association.
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Grégoire Mialon, Roberto Dessì, Maria Lomeli, Christo-
Petroni, Vladimir Karpukhin, Naman Goyal, Hein- foros Nalmpantis, Ram Pasunuru, Roberta Raileanu,
rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- Baptiste Rozière, Timo Schick, Jane Dwivedi-Yu,
täschel, et al. 2020. Retrieval-augmented generation Asli Celikyilmaz, et al. 2023. Augmented language
for knowledge-intensive nlp tasks. Advances in Neu- models: a survey. arXiv preprint arXiv:2302.07842.
ral Information Processing Systems, 33:9459–9474.
Tomáš Mikolov, Wen-tau Yih, and Geoffrey Zweig.
Bill Yuchen Lin, Xinyue Chen, Jamin Chen, and Xiang 2013. Linguistic regularities in continuous space
Ren. 2019. KagNet: Knowledge-aware graph net- word representations. In Proceedings of the 2013
works for commonsense reasoning. In Proceedings conference of the north american chapter of the as-
of the 2019 Conference on Empirical Methods in Nat- sociation for computational linguistics: Human lan-
ural Language Processing and the 9th International guage technologies, pages 746–751.
Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), pages 2829–2839, Hong Kong, Marvin Minsky. 1988. Society of mind. Simon and
China. Association for Computational Linguistics. Schuster.
10
Subhabrata Mukherjee, Arindam Mitra, Ganesh Jawa- Sander Schulhoff, Michael Ilie, Nishant Balepur, Kon-
har, Sahaj Agarwal, Hamid Palangi, and Ahmed stantine Kahadze, Amanda Liu, Chenglei Si, Yin-
Awadallah. 2023. Orca: Progressive learning from heng Li, Aayush Gupta, HyoJung Han, Sevien Schul-
complex explanation traces of gpt-4. arXiv preprint hoff, et al. 2024. The prompt report: A system-
arXiv:2306.02707. atic survey of prompting techniques. arXiv preprint
arXiv:2406.06608.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. Bleu: a method for automatic evalu- Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.
ation of machine translation. In Proceedings of the Conceptnet 5.5: An open multilingual graph of gen-
40th annual meeting of the Association for Computa- eral knowledge. In Proceedings of the AAAI confer-
tional Linguistics, pages 311–318. ence on artificial intelligence, volume 31.
Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Oren Sultan, Yonatan Bitton, Ron Yosef, and Dafna
Ruxandra Cojocaru, Alessandro Cappelli, Hamza Shahaf. 2024. Parallelparc: A scalable pipeline for
Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, generating natural-language analogies. In Proceed-
and Julien Launay. 2023. The RefinedWeb dataset ings of the 2024 Conference of the North American
for Falcon LLM: outperforming curated corpora Chapter of the Association for Computational Lin-
with web data, and web data only. arXiv preprint guistics: Human Language Technologies (Volume 1:
arXiv:2306.01116. Long Papers), pages 5900–5924.
Alec Radford, Jeff Wu, Rewon Child, David Luan, Oren Sultan and Dafna Shahaf. 2022. Life is a circus
Dario Amodei, and Ilya Sutskever. 2019a. Language and we are the clowns: Automatically finding analo-
models are unsupervised multitask learners. gies between situations and processes. In Proceed-
ings of the 2022 Conference on Empirical Methods
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, in Natural Language Processing, pages 3547–3562.
Dario Amodei, Ilya Sutskever, et al. 2019b. Lan-
guage models are unsupervised multitask learners. Terrence Szymanski. 2017. Temporal word analogies:
OpenAI blog, 1(8):9. Identifying lexical replacement with diachronic word
embeddings. In Proceedings of the 55th annual meet-
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine ing of the association for computational linguistics
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, (volume 2: short papers), pages 448–453.
Wei Li, and Peter J. Liu. 2020a. Exploring the limits
of transfer learning with a unified text-to-text trans- Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann
former. J. Mach. Learn. Res., 21(1). Dubois, Xuechen Li, Carlos Guestrin, Percy Liang,
and Tatsunori B. Hashimoto. 2023. Stanford alpaca:
Colin Raffel, Noam Shazeer, Adam Roberts, Kather- An instruction-following llama model. https://
ine Lee, Sharan Narang, Michael Matena, Yanqi github.com/tatsu-lab/stanford_alpaca.
Zhou, Wei Li, and Peter J Liu. 2020b. Exploring the
limits of transfer learning with a unified text-to-text Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
transformer. Journal of machine learning research, Martinet, Marie-Anne Lachaux, Timothée Lacroix,
21(140):1–67. Baptiste Rozière, Naman Goyal, Eric Hambro,
Faisal Azhar, et al. 2023. Llama: Open and effi-
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and cient foundation language models. arXiv preprint
Percy Liang. 2016. Squad: 100,000+ questions arXiv:2302.13971.
for machine comprehension of text. arXiv preprint
arXiv:1606.05250. Lewis Tunstall, Edward Beeching, Nathan Lambert,
Nazneen Rajani, Kashif Rasul, Younes Belkada,
Laria Reynolds and Kyle McDonell. 2021. Prompt pro- Shengyi Huang, Leandro von Werra, Clémentine
gramming for large language models: Beyond the Fourrier, Nathan Habib, et al. 2023. Zephyr: Di-
few-shot paradigm. In Extended Abstracts of the rect distillation of lm alignment. arXiv preprint
2021 CHI Conference on Human Factors in Com- arXiv:2310.16944.
puting Systems, CHI EA ’21, New York, NY, USA.
Association for Computing Machinery. Peter D Turney. 2005. Measuring semantic similarity by
latent relational analysis. arXiv preprint cs/0508053.
Adam Roberts, Colin Raffel, and Noam Shazeer. 2020.
How much knowledge can you pack into the param- Peter D Turney and Michael L Littman. 2003. Combin-
eters of a language model? In Proceedings of the ing independent modules in lexical multiple-choice
2020 Conference on Empirical Methods in Natural problems. In Recent Advances in Natural Language
Language Processing (EMNLP), pages 5418–5426, Processing III, page 101–110.
Online. Association for Computational Linguistics.
Asahi Ushio, Luis Espinosa Anke, Steven Schockaert,
Joshua Robinson, Christopher Michael Rytting, and and Jose Camacho-Collados. 2021. Bert is to nlp
David Wingate. 2023. Leveraging large language what alexnet is to cv: Can pre-trained language mod-
models for multiple choice question answering. els identify analogies? In Proceedings of the 59th
Preprint, arXiv:2210.12353. Annual Meeting of the Association for Computational
11
Linguistics and the 11th International Joint Confer- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
ence on Natural Language Processing (Volume 1: Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
Long Papers), pages 3609–3624. et al. 2022. Chain-of-thought prompting elicits rea-
soning in large language models. Advances in neural
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob information processing systems, 35:24824–24837.
Uszkoreit, Llion Jones, and Aidan N Gomez. 2017a.
L. u. kaiser, and i. polosukhin,“attention is all you Thilini Wijesiriwardene, Ruwan Wickramarachchi, Bi-
need,”. Advances in neural information processing mal Gajera, Shreeyash Gowaikar, Chandan Gupta,
systems, 30:5998–6008. Aman Chadha, Aishwarya Naresh Reganti, Amit
Sheth, and Amitava Das. 2023. ANALOGICAL -
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob a novel benchmark for long text analogy evaluation
Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz in large language models. In Findings of the Asso-
Kaiser, and Illia Polosukhin. 2017b. Attention is ciation for Computational Linguistics: ACL 2023,
all you need. In Proceedings of the 31st Interna- pages 3534–3549, Toronto, Canada. Association for
tional Conference on Neural Information Processing Computational Linguistics.
Systems, NIPS’17, page 6000–6010, Red Hook, NY,
USA. Curran Associates Inc. Sam Witteveen and Martin Andrews. 2019. Paraphras-
ing with large language models. In Proceedings of
Denny Vrandečić and Markus Krötzsch. 2014. Wiki-
the 3rd Workshop on Neural Generation and Trans-
data: a free collaborative knowledgebase. Communi-
lation, pages 215–220, Hong Kong. Association for
cations of the ACM, 57(10):78–85.
Computational Linguistics.
Ekaterina Vylomova, Laura Rimell, Trevor Cohn, and
Timothy Baldwin. 2016. Take and took, gaggle and Qinyuan Ye, Maxamed Axmed, Reid Pryzant, and
goose, book and read: Evaluating the utility of vector Fereshte Khani. 2023. Prompt engineering a prompt
differences for lexical relation learning. In Proceed- engineer. arXiv preprint arXiv:2311.05661.
ings of the 54th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers), Wenhao Yu, Chenguang Zhu, Zaitang Li, Zhiting Hu,
pages 1671–1682, Berlin, Germany. Association for Qingyun Wang, Heng Ji, and Meng Jiang. 2022. A
Computational Linguistics. survey of knowledge-enhanced text generation. ACM
Computing Surveys, 54(11s):1–38.
Chenhao Wang, Yubo Chen, Zhipeng Xue, Yang Zhou,
and Jun Zhao. 2021a. Cognet: Bridging linguis- Siyu Yuan, Jiangjie Chen, Changzhi Sun, Jiaqing Liang,
tic knowledge, world knowledge and commonsense Yanghua Xiao, and Deqing Yang. 2023. Analogykb:
knowledge. In Proceedings of the AAAI Conference Unlocking analogical reasoning of language models
on Artificial Intelligence, volume 35, pages 16114– with a million-scale knowledge base. arXiv preprint
16116. arXiv:2305.05994.
Liyan Wang and Yves Lepage. 2020. Vector-to- Yadong Zhang, Shaoguang Mao, Tao Ge, Xun Wang,
sequence models for sentence analogies. In 2020 Adrian de Wynter, Yan Xia, Wenshan Wu, Ting Song,
International Conference on Advanced Computer Man Lan, and Furu Wei. 2024. Llm as a mastermind:
Science and Information Systems (ICACSIS), pages A survey of strategic reasoning with large language
441–446. IEEE. models. arXiv preprint arXiv:2404.01230.
Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and
Zhang, Zhiyuan Liu, Juanzi Li, and Jian Tang. 2021b. Sameer Singh. 2021. Calibrate before use: Improv-
Kepler: A unified model for knowledge embedding ing few-shot performance of language models. In
and pre-trained language representation. Transac- International conference on machine learning, pages
tions of the Association for Computational Linguis- 12697–12706. PMLR.
tics, 9:176–194.
Xunjie Zhu and Gerard de Melo. 2020. Sentence analo-
Yu Wang, Nedim Lipka, Ryan A Rossi, Alexa Siu, Ruiyi gies: Linguistic regularities in sentence embeddings.
Zhang, and Tyler Derr. 2024. Knowledge graph In Proceedings of the 28th International Conference
prompting for multi-document question answering. on Computational Linguistics, pages 3389–3400,
In Proceedings of the AAAI Conference on Artificial Barcelona, Spain (Online). International Committee
Intelligence, volume 38, pages 19206–19214. on Computational Linguistics.
Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Zeyuan Allen Zhu and Yuanzhi Li. 2023. Physics of
Hoi. 2021c. Codet5: Identifier-aware unified language models: Part 3.1, knowledge storage and
pre-trained encoder-decoder models for code un- extraction. arXiv preprint arXiv:2309.14316.
derstanding and generation. arXiv preprint
arXiv:2109.00859.
A Model Details
Taylor Webb, Keith J Holyoak, and Hongjing Lu. 2023.
Emergent analogical reasoning in large language Falcon (Almazrouei et al., 2023): The Falcon
models. Nature Human Behaviour, 7(9):1526–1541. model used in this work is the Falcon-7B-Instruct
12
7.9 7.2
anT5 7.44 15.96 18.51 12.19

ca 14.62 9.52 9.28 22.24
7.62 12.72 12.96 11.14
model 13 which is a causal decode-only model, Zephyr18 : We use the Zephyr-7B-alpha with
instruction finetuned on top of the base Falcon-7B. 7B parameters finetuned from Mistral-7B-v0.1.
The fine-tuning dataset is made up of 250M tokens The finetune datasets contain synthetic dialogues
from various conversational datasets (Baize14 ), ranked by GPT-4 and a prompt completion dataset
instruction datasets (GPT4All (Anand et al., where completions are ranked by GPT-4.
2023), GPTeacher15 ) and common crawl data
(RefinedWeb (Penedo et al., 2023))from the web. CodeT5 (Le et al., 2022): The CodeT5 model
Falcon-7B tokenizer is used for tokenization. The wwe use is codet5-large model with 770M
architecture of Falcon is broadly adapted from parameters. The model is trained on Masked Span
Prompt Types
Five-shot GPT3 with changes in positional
Structured embeddings
Knowledge (semantic)Targeted used,
Knowledge Prediction objective on CodeSearchNet dataset
844.46 282.918
attention mechanisms used and decoder block 175.68 (Husain et al., 2019)
22.11111111 17.86777778 26.40888889
1 architecture. 0.235581269 0.089599782
0.496812801 0 1
CodeParrot (Jain, 2023): We use the 1.5B
FlanT5 (Chung et al., 2022): We use the parameter CodeParrot model based on GPT-2. The
FlanT5-XXL version with 11B parameters. This model is trained to generate python code on a
version is based on a pretrained
Prompt T5
Types(Raffel et al., python files dataset from GitHub19 .
Zero-shot One-shot Five-shot Structured Knowledge (semantic)Targeted Knowledge
0
2020b) and instruction finetuned0.22 on a mixture of 1 0.24 0.09
0.90 tasks. This model is finetuned specifically
0.97 with
0.50 GPT-3.5-Turbo020 : We use OpenAI 1 API to access
Chain-of-Thought data. the model, gpt-3.5-turbo-0125.
ormalized Normalized Averaged Model
veraged Prompt Peak Model Performance Performance
0
GPT2 (Radford0.457
et al., 2019b):
0.90
We use the
Normalized Averaged Prompt Length
0.22 XL version0.4009
with 1.5 parameters.0.97The model is Peak Model Performance
1 0.4121 0.50 1
0.24
pretrained 0.3879
with English language data 0
(40 GB of
0.09 text from the web) and causal language
0.5525 1 modeling 0.8
objective. Interestingly the model is not trained on
0.6
N knowledge only articles from Wikipedia.
25.04
30.7 0.4
1 Mistral (Jiang et al., 2023): This is a decoder
23.34 0.2
9.28 only transformer model and we use the Mistral-7B-
9.75 Instruct version with 7B parameters. This version
22.7
0
0 is finetuned on publicly available instruction Zero-shot One-shot Five-shot Structured Targeted
7.9 Knowledge Knowledge
14.41
datasets. Mistral introduce Sliding Window (semantic)
Attention, Rolling Buffer Cache and Pre-fill

N knowledge only Chunking in its architecture. Figure 5: Prompt Lengths vs. Peak Performance
13.53
Orca (Mathur, 2023): We employ orca_mini_7b,

a 7B parameter version of Orca, which is based on B Performance and Additional Results
OpenLLaMA-7B. The model is trained on datasets
B.1 Model Performance vs. Prompt Length
with explanation tuning, where the response from
(PL)
the <query, response> pair is augmented with
detailed responses from the base (teacher) model We calculated the average prompt lengths across
WD knowledge only
(Mukherjee et al., 2023). The explanation tuning models for each prompting technique (PL for
CN knowledge only SKP is calculated by averaging SKP[random] and
WN knowledge only
datasets used are WizardLM16 , Alpaca dataset
(Taori et al., 2023) and Dolly17 and system prompts SKP[semantic]) (See Figure 5). According to (Liu
are used to elicit step-by-step explanations. et al., 2024), longer prompts (with important in-
formation placed in the middle) tend to negatively
13
affect performance. Based on such literature, one
y https://huggingface.co/tiiuae/
falcon-7b-instruct 18
https://huggingface.co/HuggingFaceH4/
14
https://github.com/project-baize/ zephyr-7b-alpha
baize-chatbot/tree/main/data 19
https://huggingface.co/datasets/codeparrot/
15
https://github.com/teknium1/GPTeacher codeparrot-clean
16 20
https://github.com/nlpxucan/WizardLM https://platform.openai.com/docs/models/
17
https://github.com/databrickslabs/dolly gpt-3-5-turbo
13
Model Name WD knowledge only CN knowledge only WN knowledge only All knowledge available
Falcon 25.14 25.38 25.04 24.14
FlanT5 19.63 28.15 30.7 12.19
GPT2 4.11 2.4 1 5.84
Mistral 22.84 22.07 23.34 24.11
Orca 14.62 9.52 9.28 22.24
Zephyr 16.16 12.49 9.75 14.76
CodeT5 21.47 21.64 22.7 20.7
CodeParrot 0 0 0 0
GPT-3.5-Turbo 7.13 7.43 7.9 7.2
Table 3: Performance of models based on provided knowledge types. Performance values are reported in EMA
percentage and calculated using 2995 (∼20%) data points that had all three knowledge types available.
might suggest that Zero-shot prompts yield better

results in our study because they are short, but this
is not the case. Despite being longer than Zero-
shot prompts, a higher peak model performance is
achieved by TKP.
C Prompts
Figues 6, 7, 8, 9 and 10 illustrates example prompts
provided to models.
14
Zero-shot Prompt
Question: What is the analogical word pair to, "Lens" and "Glass" from the following choices.
Choices:
1. "Well" and "Water"
2. "Saw" and "Wood"
3. "Sweater" and "Wool"
4. "Fuel" and "Fire"
5. "Ink" and "Paper"
The answer should only be 1 or 2 or 3 or 4 or 5?.
Answer:
Figure 6: Example of a Zero-shot prompt used on our dataset
One-shot Prompt
Look at the following example and answer the question below.
Example:
Question: What is the analogical word pair to, "Cloth" and "Threads" from the
following choices. Choices:
1. "Gun" and "Bullets"
2. "Guitar" and "Drums"
3. "Chain" and "Links"
4. "Star" and "Planets"
Answer: 3
Question: What is the analogical word pair to, "Lens" and "Glass" from the following
choices. Choices:
2. "Saw" and "Wood"
Answer:
Figure 7: Example of a One-shot prompt used on our dataset
15
Five-shot Prompt
Look at the following examples and answer the question below.
Example 1:
Question: What is the analogical word pair to, "Cloth" and "Threads" from the following
choices. Choices:
1. "Gun" and "Bullets"
2. "Guitar" and "Drums"
3. "Chain" and "Links"
4. "Star" and "Planets"
Answer: 3
Example 2:
Question: What is the analogical word pair to, "Drapery" and "Fabric" from the following
choices. Choices:
1. "Fireplace" and "Wood"
2. "Curtain" and "Stage"
3. "Shutter" and "Light"
4. "Sieve" and "Liquid"
5. "Window" and "Glass"
Answer: 5
Example 3:
......................
Example 4:
......................
Example 5:
......................
Question: What is the analogical word pair to, "Lens" and "Glass" from the following choices. Choices:
2. "Saw" and "Wood"
Answer:
Figure 8: Example of a Five-shot prompt used on our dataset
Structured Knowledge Prompt
Question: What is the analogical word pair to, "Lens" and "Glass" from the following
choices. Choices:
2. "Saw" and "Wood"
Use following knowledge to find the correct answer choice.
Question Knowledge: glass related to device, device is a camera, camera is a lens; optical device is
a device, device is a instrumentality, instrumentality is a container, container is a glass
Answer choice 1 knowledge: well is a place, place is a water_route, water_route is a water
Answer choice 2 knowledge: wood made of house, house made of metal, metal made of saw; power tool is
a machine, machine is a device, device is a instrument, instrument is a wind, wind is a wood
Answer choice 3 knowledge: coat made of wool, wool related to warm, warm related to sweater; garment
is a habiliment, habiliment is a covering, covering is a artifact, artifact is a textile, textile is
a wool
Answer choice 4 knowledge: fuel is a substance, substance is a element, element is a fire
Answer choice 5 knowledge: ink at location sign, sign at location paper; ink is a change, change is a
cover, cover is a paper
Answer:
Figure 9: Example of a Structured Knowledge Prompt[semantic] used on our dataset
16
Targeted Knowledge Prompt
Question: What is the analogical word pair to, "Lens" and "Glass" from the following choices. Choices:
2. "Saw" and "Wood"
The implicit relation shared by "lens" and "glass" is "made of". The correct choice should have the same implicit
relation among the two words.
Answer:
Figure 10: Example of a Targeted Knowledge Prompt used on our dataset
17

2412.00869v1

Uploaded by

Copyright:

Available Formats

2412.00869v1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2412.00869v1

Uploaded by

Copyright:

Available Formats

Exploring the Abilities of Large Language Models to Solve

Proportional Analogies via Knowledge-Enhanced Prompting

Thilini Wijesiriwardene1 , Ruwan Wickramarachchi1 , Sreeram Vennam2 , Vinija Jain4,5 * ,

Abstract 2022; Ushio et al., 2021; Szymanski, 2017; Drozd

terms, are often used to assess linguistic and

No Knowledge Zero-shot Prompting

Targeted Knowledge Prompting

Figure 1: Knowledge-enhanced Prompting. An illustration of our knowledge-enhanced prompting approach with

"Term 1 semantically 4 Experimental Setting

Semantic Similarity We have conducted a comprehensive set of experi-

5.4 Cost of Knowledge Acquisition vs. Model No

0.4 0.38 CodeParrot0.14 in Five-shot;

anT5 7.44 15.96 18.51 12.19

Attention, Rolling Buffer Cache and Pre-fill

Orca (Mathur, 2023): We employ orca_mini_7b,

might suggest that Zero-shot prompts yield better

Figure 6: Example of a Zero-shot prompt used on our dataset

Look at the following example and answer the question below.

Figure 7: Example of a One-shot prompt used on our dataset

Look at the following examples and answer the question below.

Figure 8: Example of a Five-shot prompt used on our dataset

Structured Knowledge Prompt

Figure 9: Example of a Structured Knowledge Prompt[semantic] used on our dataset

Figure 10: Example of a Targeted Knowledge Prompt used on our dataset

You might also like