KnowPath - Reasoning Via LLM-generated Inference Paths
KnowPath - Reasoning Via LLM-generated Inference Paths
Abstract
arXiv:2502.12029v1 [cs.AI] 17 Feb 2025
1 Introduction
requires fine-tuning their model parameters, which inevitably
Large language models (LLMs) are increasingly being ap-
incurs high computational costs[Sun et al., 2024]. In con-
plied in various fields of Natural Language Processing (NLP)
trast, updating knowledge graphs is relatively simple and in-
tasks, such as text generation[Wang et al., 2024; Dong et
curs minimal overhead.
al., 2023], knowledge-based question answering[Luo et al.,
2024a; Zhao et al., 2024], and over specific domains[Al- The paradigms of combining LLMs with KGs can be clas-
berts et al., 2023; Jung et al., 2024]. In most scenarios, sified into three main categories. The first one is knowledge
LLMs serve as intermediary agents for implementing vari- injection during pre-training or fine-tuning[Luo et al., 2024b;
ous functions[Hu et al., 2024; Huang et al., 2024; Guo et al., Cao et al., 2023; Jiang et al., 2022; Yang et al., 2024]. While
2024]. However, due to the characteristics of generative mod- the model’s ability to grasp knowledge improves, these meth-
els, LLMs still suffer from hallucination issues, often gener- ods introduce high computational costs and catastrophic for-
ating incorrect answers that can lead to uncontrollable and getting. The second one entails using LLMs as agents to
severe consequences[Li et al., 2024]. Introducing knowledge reason through knowledge retrieved from the KGs. This ap-
graphs (KGs) to mitigate this phenomenon is promising[Yin proach does not require fine-tuning or retraining, significantly
et al., 2022]. This is because knowledge graphs store a large reducing overhead[Jiang et al., 2023; Yang et al., 2023].
amount of structured factual knowledge, which can provide However, It heavily relies on the completeness of external
large models with accurate knowledge dependencies. At the KGs and underutilizes the internal knowledge of the LLMs.
same time, correcting the knowledge in large models often The third one enables LLMs to participate in the process of
knowledge exploration within external KGs[Ma et al., 2024].
∗
Corresponding author In this case, the LLMs can engage in the selection of knowl-
Question:Where is located the Whistler mountain, and is the same place in which there has been the Demi Lovato Summer Toor 2009?
Whistler Mountain ,located_in , British Columbia
(a) Whistler Mountain , part_of , Coast Mountains [Whistler Mountain->located_in->British Columbia,Demi
Inference Paths Demi Lovato Summer Tour 2009 , held_in , United States Lovato Summer Tour 2009->held_in->United States]
Generation LLM Demi Lovato Summer Tour 2009 , performer , Demi Lovato
guide
........
Figure 2: The workflow of KnowPath. It contains: (a) Inference Paths Generation to exploit the internal knowledge of LLMs, (b) Subgraph
Exploration to generate a trustworthy directed subgraph, (c) Evaluation-based Answering to integrate internal and external knowledge.
edge nodes at each step[Sun et al., 2024; Chen et al., 2024; • We build a knowledge-enhanced large model frame-
Xu et al., 2024], thereby leveraging the advantages of the in- work driven by the collaboration of internal and exter-
ternal knowledge of the LLMs to some extent. nal knowledge. It not only integrates both the internal
The effective patterns of LLMs introducing KGs still have and external knowledge of the LLMs better, but also pro-
limitations. 1) Insufficient exploration of internal knowledge vides clearer and more trustworthy reasoning paths.
in LLMs. When exploring KGs, most approaches primarily • Extensive experiments conducted on multiple knowl-
treat LLMs as agents to select relevant relationships and en- edge question answering datasets demonstrate that our
tities, overlooking the potential of the internal knowledge. 2) KnowPath significantly mitigates the hallucination prob-
Constrained generation of trustworthy reasoning paths. Some lem in LLMs and outperforms the existing best.
methods have attempted to generate highly interpretable rea-
soning paths, but they limit the scale of path exploration, re-
quire additional memory. The generated paths also lack intu- 2 KnowPath
itive visual interpretability. 3) Ambiguous fusion of internal 2.1 Preliminary
and external knowledge. How to better integrate the internal Topic Entities represent the main entities in a query Q, de-
knowledge of LLMs with the external knowledge in KGs still noted as e0 . Each Q contains N topic entities {e10 , ..., eN
0 }.
requires further exploration. Inference Paths are a set of paths P = p1 , ..., pL generated
To overcome the above limitations, we propose KnowPath, by the LLM’s own knowledge, where L ∈ [1, N ] is dynami-
a knowledge-enhanced large model framework driven by the cally determined by the LLM agent. Each path p starts from
collaboration of internal and external knowledge. Specifi- the topic entity e0 ∈ {e10 , ..., eN
0 } and can be represented as
cally, KnowPath consists of three stages. 1) Inference paths p = e0 → r1 → e1 → ... → rn → en , where ei and ri
generation. To entirely exploit the internal knowledge of represent entities and relationships, respectively.
LLMs and adept in zero-shot scenario, this stage employs a Knowledge Graph(KG) is composed of many structured
prompt-driven approach to extract the knowledge triples most knowledge triples: K = {(eh , r, et ), r ∈ R, eh , et ∈ E},
relevant to the topic entities, and then generates reasoning where E represents all entities in the knowledge graph, and R
paths based on these knowledge triples to attempt answering represents all relationships, and eh and et represent the head
the question. 2) Trustworthy directed subgraph exploration. and tail entities, respectively.
It refers to the process where the LLM combines the previ- KG Subgraph refers to a connected subgraph extracted from
ously generated knowledge reasoning paths to select entities the knowledge graph K, where the entities and relationships
and relationships, and then responses based on the subgraph are entirely derived from K, i.e., Ks ⊆ K.
formed by these selections. This stage enables the LLMs
to fully participate in the effective construction of external 2.2 Inference Paths Generation
knowledge, while providing a clear process for constructing
Due to the extensive world knowledge stored within its pa-
subgraphs. 3) Evaluation-based answering. At this stage, ex-
rameters, LLMs can be considered as a complementary rep-
ternal knowledge primarily guides the KnowPath, while in-
resentation of KGs. To fully excavate the internal knowl-
ternal knowledge assists in generating the answer. Our con-
edge of LLMs and guide the exploration of KGs, we pro-
tributions can be summarized as follows:
pose a prompt-driven method to extract the internal knowl-
• We focus on a new view, emphasizing the importance of edge of LLMs effectively. It can retrieve reasoning paths of
the LLMs’ powerful internal knowledge in knowledge the model’s internal knowledge and clearly display the rea-
question answering, via a prompt-based internal knowl- soning process, and also is particularly effective in zero-shot
edge reasoning path generation method for LLMs. scenarios. Specifically, given a query Q, we first guide the
LLM to extract the most relevant topic entities {e10 , ..., eN
0 } Algorithm 1 Subgraph Exploration
through a specially designed prompt. Then, based on these Require: entityDict, entityN ame, question,
topic entities, the large model is instructed to generate a set maxW idth, depth, path
of knowledge triples associated with them. The number of 1: Set originalP ath as path
triples n is variable. Finally, the LLM attempts to answer 2: if depth = 0 then
based on the previously generated knowledge triples and pro- 3: Initialize path as [ ] ∗ maxW idth
vides a specific reasoning path from entities and relations to 4: end if
the answer. Each path is in the form of P = e10 → r1 → 5: for eid in entityDict do
e1 → ... → rn → en . The details of the Inference Paths 6: Find relevantRelations
Generation process are presented in the Appendix A.1. 7: for relation in relevantRelations do
8: Find entities linked by relation
2.3 Subgraph Exploration 9: end for
Exploration Initialization. KnowPath performs subgraph 10: end for
exploration for a maximum of D rounds. Each round cor- 11: Extract relevantEntities using candidate entities
responds to an additional hop in knowledge graph K and 12: Update path and entityDict based on relevance
1 N
the j-th contains N subgraphs {Ks,j , ..., Ks,j }. Each sub- 13: extraP ath ← (path − originalP ath)
i 14: return extraP ath, entityDict
graph Ks,j is composed of a set of knowledge graph reason-
i
ing paths, i.e. Ks,j = {pi1,j ∪ · · · ∪ pil,j , i ∈ [1, N ]}. The
number of reasoning paths l is flexibly determined by the
LLM agent. Taking the D-th round and the z-th path as an Algorithm 2 Update Reasoning Path in Subgraph
example, it starts exploration from one topic entity ei0 and ul-
timately forms a connected subgraph of the KG, denoted as Require: path, pathIsHead, isHead, r, e
1: if not pathIsHead then
piz,D = {ei0 , ei1,z , r1,z
i
, ei2,z , r2,z
i i
, ..., rD,z , eiD,z }. The start of
the first round of subgraph exploration (D=0), each path pi 2: if not isHead then
corresponds to the current topic entity, i.e.p0z,0 = {e10 }. 3: newP ath ← path + [←, r, ←, e].
4: else
Relation Exploration. Relation exploration aims to ex-
5: newP ath ← path + [→, r, →, e].
pand the subgraphs obtained in each round of exploration, en-
6: end if
abling deep reasoning. Specifically, for the i-th subgraph and
7: else
the j-th round of subgraph exploration, the candidate entities
8: if not isHead then
is denoted as Eji = {eij−1,1 , ..., eij−1,l }, where eij−1,1 is the
9: newP ath ← [e, →, r, →] + path.
tail entity of the reasoning path pi1,j−1 . Based on these can- 10: else
didates Eji , we search for all coresponding single-hop rela- 11: newP ath ← [e, ←, r, ←] + path.
i end if
tions in knowledge graph K, denoted as Ra,j = {r1 , ..., rM }, 12:
where M is determined by the specific knowledge graph K. 13: end if
Finally, the LLM agent will rely on the query Q, the infer- 14: Append newP ath to path
ence path P generated through the LLM’s internal knowledge 15: return path
(Section 2.2), and all topic entities e0 to select the most rel-
i
evant candidate relations from Ra,j , denoted as Rji ⊆ Ra,j i
,
which is dynamically determined by the LLM agent.
Entity Exploration. Entity exploration depends on the al- the directionality of entities and relations, but also automati-
ready determined candidate entities and candidate relations. cally determines and updates the paths. Its detailed process is
Taking the i-th subgraph and the j-th round of subgraph ex- described in Algorithm 2. The final subgraph can be flexibly
ploration as an example, relying on Eji and Rji , we perform expanded due to the variable number of paths l.
queries like (e, r, ?) or (?, r, e) on the knowledge graph K
i
to retrieve the corresponding entities Ea,j = {e1 , ..., eN },
where N varies depending on the knowledge graph K. Then, 2.4 Evaluation-based Answering
the agent also considers the query Q, the inference path P
in Section 2.2, the topic entity ei0 , and the candidate rela-
tion set Rji from Ea,j i
to generate the most relevant entity set After completing the subgraph update for each round, the
i i i i agent attempts to answer the query through the subgraph
Ej+1 = {ej,1 , ..., ej,l } ⊆ Ea,j . Note that eij,1 is the tail entity 1
{Ks,j N
, ..., Ks,j }. If it responds incorrectly, the next round
i
of the reasoning path p1,j . of subgraph exploration will be executed, until the maximum
Subgraph Update. Relation exploration determines entity exploration depth D is reached. Otherwise, it will output the
exploration, and we update the subgraph only after complet- final answer along with the corresponding interpretable di-
ing the entity exploration. Specifically, for the i-th subgraph rected subgraph. Unlike previous work [Chen et al., 2024],
and the j-th round of subgraph exploration, we append the even if no answer is found at the maximum exploration depth,
result of the exploration (, r, eij,1 ) to the path pi1,j in the sub- our KnowPath will rely on the inference path P to response.
i
graph Ks,j . This path update algorithm not only considers The framework of KnowPath is shown in Figure 2.
Method CWQ WebQSP Simple Questions WebQuestions
LLM only
IO prompt [Brown et al., 2020] 37.6 ± 0.8 63.3 ± 1.2 20.0 ± 0.5 48.7 ± 1.4
COT [Wei et al., 2022] 38.8 ± 1.5 62.2 ± 0.7 20.5 ± 0.4 49.1 ± 0.9
RoG w/o planning [Luo et al., 2024b] 43.0 ± 0.9 66.9 ± 1.3 - -
SC [Wang et al., 2022] 45.4 ± 1.1 61.1 ± 0.5 18.9 ± 0.6 50.3 ± 1.2
Fine-Tuned KG Enhanced LLM
UniKGQA [Jiang et al., 2022] 51.2 ± 1.0 75.1 ± 0.8 - -
RE-KBQA [Cao et al., 2023] 50.3 ± 1.2 74.6 ± 1.0 - -
ChatKBQA [Luo et al., 2024a] 76.5 ± 1.3 78.1 ± 1.1 85.8 ± 0.9 55.1 ± 0.6
RoG [Luo et al., 2024b] 64.5 ± 0.7 85.7 ± 1.4 73.3 ± 0.8 56.3 ± 1.0
Prompting KG Enhanced LLM with GPT3.5
StructGPT [Jiang et al., 2023] 54.3 ± 1.0 72.6 ± 1.2 50.2 ± 0.5 51.3 ± 0.9
ToG [Sun et al., 2024] 57.1 ± 1.5 76.2 ± 0.8 53.6 ± 1.0 54.5 ± 0.7
PoG [Chen et al., 2024] 63.2 ± 1.0 82.0 ± 0.9 58.3 ± 0.6 57.8 ± 1.2
KnowPath (Ours) 67.9 ± 0.6 84.1 ± 1.3 61.5 ± 0.8 60.0 ± 1.0
Prompting KG Enhanced LLM with DeepSeek-V3
ToG [Sun et al., 2024] 60.9 ± 0.7 82.6 ± 1.0 59.7 ± 0.9 57.9 ± 0.8
PoG [Chen et al., 2024] 68.3 ± 1.1 85.3 ± 0.9 63.9 ± 0.5 61.2 ± 1.3
KnowPath (Ours) 73.5 ± 0.9 89.0 ± 0.8 65.3 ± 1.0 64.0 ± 0.7
Table 1: Hits@1 scores (%) of different models on four datasets under various knowledge-enhanced methods. We use GPT-3.5 Turbo and
DeepSeek-V3 as the primary backbones. Bold text indicates the results achieved by our method.
3 Experimental Setup Damx is set to 3. Since the FreeBase [Bollacker et al., 2008]
3.1 Baselines supports all the aforementioned datasets, we apply it as the
base graph for subgraph exploration, and We apply GPT-3.5-
We chose corresponding advanced baselines for comparison turbo-1106 and DeepSeek-V3 as the base models. All exper-
based on the three main paradigms of existing knowledge- iments are deployed on four NVIDIA A800-40G GPUs.
based question answering. 1) The First is the LLM-only, in-
cluding the standard prompt (IO prompt[Brown et al., 2020]), 4 Result
the chain of thought prompt (CoT[Wei et al., 2022]), the
self-consistency (SC[Wang et al., 2022]), and the RoG with- 4.1 Main results
out planning (ROG w/o planning[Luo et al., 2024b]). 2) We conducted comprehensive experiments on four widely
The second is the KG-enhanced fine-tuned LLMs, which in- used knowledge-based question answering datasets. The ex-
clude ChatKBQA[Luo et al., 2024a], RoG[Luo et al., 2024b], perimental results are presented in Table 1, and four key find-
UniKGQA[Jiang et al., 2022], and RE-KBQA[Cao et al., ings are outlined as follows:
2023]. 3) The third is the KG-enhanced prompt-based LLMs, KnowPath performs the best. Our KnowPath outper-
including Think on graph (ToG[Sun et al., 2024]), Plan on forms all the Prompting-driven KG-Enhanced. For instance,
graph (PoG[Chen et al., 2024]), and StructGPT[Jiang et al., on the multi-hop CWQ, regardless of the base model used,
2023]. Unlike the second, this scheme no longer requires KnowPath achieves a maximum improvement of about 13%
fine-tuning and has become a widely researched mode today. in Hits@1. In addition, KnowPath outperforms the LLM-
only with a clear margin and surpasses the majority of Fine-
3.2 Datasets and Metrics Tuned KG-Enhanced LLM methods. On the most challeng-
Datasets. We adopt four knowledge-based question answer- ing open-domain question answering dataset WebQuestions,
ing datasets: the single-hop Simple Questions [Bordes et KnowPath achieves the best performance compared to strong
al., 2015], the complex multi-hop CWQ [Talmor and Berant, baselines from other paradigms (e.g., PoG 61.2% vs Ours
2018] and WebQSP [Yih et al., 2016], and the open-domain 64.0%). This demonstrates KnowPath’s ability to enhance
WebQuestions [Berant et al., 2013]. the factuality of LLMs in open-domain question answering,
Metrics. Following previous research [Chen et al., 2024], which is an intriguing phenomenon worth further exploration.
we apply exact match accuracy (Hits@1) for evaluation. KnowPath excels at complex multi-hop tasks. On
both CWQ and WebQSP, KnowPath outperforms the lat-
3.3 Experiment Details est strong baseline PoG, achieving an average improvement
Following previous research [Chen et al., 2024], to control of approximately 5% and 2.9%, respectively. On the We-
the overall costs, the maximum subgraph exploration depth bQSP, DeepSeek-v3 with KnowPath not only outperforms
Method CWQ WebQSP SimpleQA WebQ Method LLM Call Total Token Input Token
KnowPath 73.5 89.0 65.3 64.0 ToG 22.6 9669.4 8182.9
-w/o IPG 67.3 84.5 63.1 61.0 PoG 16.3 8156.2 7803.0
-w/o SE 64.7 83.1 60.4 60.7 KnowPath 9.9 2742.4 2368.9
Base 39.2 66.7 23.0 53.7
Table 3: Cost-effectiveness analysis on the CWQ dataset between
Table 2: Ablation experiment results on four knowledge-based ques- our KnowPath and the strongly prompt-driven knowledge-enhanced
tion answering tasks. IPG stands for Inference Paths Generation benchmarks (ToG and PoG). The Total Token includes two parts:
module, while SE stands for Subgraph Exploration module. the total number of tokens from multiple input prompts and the total
number of tokens from the intermediate results returned by the LLM.
The Input Token represents only the total number of tokens from the
all Prompting-based KG-Enhanced LLMs but also surpasses multiple input prompts. The LLM Call refer to the total number of
the strongest baseline ROG among Fine-Tuned KG-Enhanced accesses to the LLM agent.
LLMs (85.7% vs 89%). On the more challenging multi-hop
CWQ, the improvement of KnowPath over the PoG is signif-
icantly greater than the improvement on the simpler single-
hop SimpleQuestions (5.2% vs 1.4%). These collectively in-
dicate that KnowPath is sensitive to deep reasoning.
Knowledge enhancement greatly aids factual question
answering. When question answering is based solely on
LLMs, the performance is poor across multiple tasks. For
example, COT achieves only about 20.5% Hits@1 on Sim-
pleQuestions. This is caused by the hallucinations inherent
in LLMs. Whatever method is applied to introduce the KGs,
they significantly outperform LLM-only. The maximum im-
provements across the four tasks are 35.9%, 27.9%, 46.4%,
and 15.3%. These further emphasize the importance of intro-
ducing knowledge graphs for generating correct answers.
The stronger the base, the higer the performance.
As DeepSeek-V3 is better than GPT-3.5, even though both
are prompting-based knowledge-enhanced, their performance
on all tasks shows a significant difference after incorporat-
ing our KnowPath. Replacing GPT-3.5 with DeepSeek-V3,
KnowPath achieved a maximum improvement from 67.9% to
73.5% on CWQ, and on Simple Questions, it improved by Figure 3: Comparison of KnowPath, its individual components, and
at least 3.8%. These findings indicate that the improvement strong baseline methods (ToG and PoG) on the performance across
in model performance directly drives the enhancement of its four commonly used knowledge-based question answering datasets.
performance in knowledge-based question-answering.
KnowPath is a more flexible plugin. Compared to fine-
tuned knowledge-enhanced LLMs, our KnowPath does not to the base model, the addition of these modules still signifi-
require fine-tuning of the LLM, yet it outperforms most of the cantly improves the overall performance.
fine-tuned methods. In addition, on the CWQ dataset, Know- It is necessary to focus on the powerful internal knowl-
Path with DeepSeek-V3 achieves performance that is very edge of LLMs. Eliminating the Subgraph Exploration and
close to the strongest baseline, ChatKBQA, which requires relying solely on the internal knowledge mining of LLMs to
fine-tuning for knowledge enhancement. On the WebQSP generate reasoning paths and provide answers proves to be
dataset, it outperforms ChatKBQA by about 11% (78.1% highly effective. It has shown significant improvement across
vs 89.0%). Overall, the resource consumption of KnowPath all four datasets, with an average performance enhancement
is significantly lower than that of Fine-Tuned KG-Enhanced of approximately 21.6%. The most notable improvement was
LLMs. This is because KnowPath improves performance by observed on SimpleQA, where performance leaped from 23%
optimizing inference paths and enhancing knowledge integra- to 60.4%. This indicates that even without the incorporation
tion, making it a more flexible and plug-and-play framework. of external knowledge graphs, the performance of the model
in generating factual responses can be enhanced to a certain
4.2 Ablation Study extent through internal mining methods. However, without
We validate the effectiveness of each component of Know- the guidance of internal knowledge reasoning paths, Know-
Path and quantify their contributions to performance. Its re- Path has seen some performance decline across all tasks, es-
sults are presented in Table 2, and visualized in Figure 3. pecially in complex multi-hop CWQ and WebQSP.
Each component contributes to the overall remarkable The most critical credible directed Subgraph Explo-
performance. After removing each module, their perfor- ration is deep-sensitive. Removing the subgraph exploration
mance on different datasets will decline. However, compared leads to a significant decline in Knowpath across all tasks, av-
Figure 4: Visualization of the cost-effectiveness analysis on four public knowledge-based question-answering datasets.
There is no explicit information provided about Zhang Jue. To answer this question, additional knowledge or data about
ToG the league cup winners in 2002 would be required.
The answer to the question is Taiping Jing.
The winner of the 2002 Football League Cup was Blackburn Rovers.
PoG Unable to answer this question, use cot to answer:the reasoning_chains:
["Football League Cup", "sports.sports_championship.events", "2002 Football
question is Taiping Jing.
League Cup Final"],["Football League Cup","sports.sports_championship_event
-.champion-nship", "2002 Football League Cup Final"]
Inference Path : Zhang Jue -> is a key figure in -> Way of the Five Inference Path : 2002 League Cup -> was won by -> Birmingham City ->
Pecks of Rice -> is a -> Taoist sect -> Taoism -> is based on -> Tao Te defeated -> Liverpool -> 2002 League Cup -> was contested between ->
Ching -> is considered to be -> sacred text in Taoism. Birmingham City and Liverpool.
The answer is Tao Te Ching, Daozang, Zhuang Zhou. the answer to the question is Liverpool F.C.
(a.)A case from CWQ (b.)A case from WebQuestions
Figure 6: The case study on the multi-hop CWQ and open-domain WebQuestions dataset. To provide a clear and vivid comparison with the
strong baselines (ToG and PoG), we visualized the execution process of KnowPath