GUNDAM - Aligning Large Language Models With Graph Understanding
GUNDAM - Aligning Large Language Models With Graph Understanding
G RAPH U NDERSTANDING
A P REPRINT
Yong Liu
Renmin University of China
[email protected]
A BSTRACT
Large Language Models (LLMs) have achieved impressive results in processing text data, which
has sparked interest in applying these models beyond textual data, such as graphs. In the field of
graph learning, there is a growing interest in harnessing LLMs to comprehend and manipulate graph-
structured data. Existing research predominantly focuses on graphs with rich textual features, such as
knowledge graphs or text attribute graphs, leveraging LLMs’ ability to process text but inadequately
addressing graph structure. This work specifically aims to assess and enhance LLMs’ abilities to
comprehend and utilize the structural knowledge inherent in graph data itself, rather than focusing
solely on graphs rich in textual content. To achieve this, we introduce the Graph Understanding
for Natural Language Driven Analytical Model (GUNDAM). This model adapts LLMs to better
understand and engage with the structure of graph data, enabling them to perform complex reasoning
tasks by leveraging the graph’s structure itself. Our experimental evaluations on graph reasoning
benchmarks not only substantiate that GUNDAM outperforms the SOTA baselines for comparisons.
But also reveals key factors affecting the graph reasoning capabilities of LLMs. Moreover, we provide
a theoretical analysis illustrating how reasoning paths can enhance LLMs’ reasoning capabilities.
1 Introduction
LLMs have achieved remarkable success in processing serialized natural language data Touvron et al. [2023], Achiam
et al. [2023], Anthropic [2024]. Recent explorations have pushed the boundaries of LLM applications beyond textual
data, particularly into the realm of computer vision Wang et al. [2024], Liu et al. [2024]. Parallel to these developments,
the utilization of LLMs in graph representation learning has emerged as a focal point of interest Wang et al. [2023], He
et al. [2024], Liu et al. [2023], Fatemi et al. [2024].
Considerable efforts have been devoted to deploying LLMs for processing graph data Huang et al. [2024], Chen et al.
[2024a]. For instance, some work Liu et al. [2023], He et al. [2024] have utilized LLMs to handle rich-text graphs
where node text attributes are used for node classification. Those work essentially translate the node classification
task into a text classification task. While yielding impressive results, they primarily leverage the capability of LLMs
to process textual features, consequently diminishing the emphasis on intrinsic graph data features, particularly the
structural aspects of the graph. In light of this, we aim to explore a more fundamental and universal question: Can
LLMs genuinely understand graph data, especially the structural knowledge, and rely on it to perform complex
reasoning tasks?
Investigating whether LLMs can comprehend graph data is crucial, as graphs are fundamental data structures that
represent entities and their complex interrelations, effectively modeling various real-world scenarios Wu et al. [2020],
Li et al. [2021]. Enhancing LLMs’ ability to understand graph data has the potential to significantly advance their
general intelligence Bubeck et al. [2023]. Many tasks, such as recommendation systems Fan et al. [2019] and
GUNDAM A P REPRINT
knowledge graphs Yuan et al. [2022], explicitly or implicitly employ graph structures for reasoning processes. Recent
advances Wang et al. [2023], Chen et al. [2024b], Luo et al. [2024] have emphasized the exploration of LLMs’ abilities
in understanding and reasoning over graph structures. Specifically, benchmarks such as NLGraph Wang et al. [2023] and
GraphQA Fatemi et al. [2024] have been developed to evaluate the performance of LLMs on various graph reasoning
tasks. Studies utilizing closed-source models on these benchmarks indicate that while LLMs demonstrate fundamental
competencies in addressing simple reasoning tasks, they struggle with more complex graph reasoning challenges, as
observed by Wang et al. [2023] and Fatemi et al. [2024].
Parallelly, the reasoning capabilities of open-source LLMs on graphs have garnered significant attention, with research
primarily bifurcating into two categories. The first category focuses on leveraging prompt techniques to enhance the
inherent reasoning abilities of LLMs for graph reasoning tasks. For instance, Wang et al. [2023] and Fatemi et al.
[2024] explored advanced prompt strategies such as Chain of Thought (CoT) and Self-consistency (SC) to augment
LLMs’ reasoning efficacy. Additionally, Perozzi et al. [2024a] introduced soft prompt techniques that map graph
structures onto an LLM’s embedding space for reasoning. These prompt-based approaches, though cost-effective due
to negligible retraining or minimal parameter tuning, generally yield limited improvements in reasoning capabilities.
The second research avenue involves fine-tuning LLMs with synthetically generated graph reasoning data, which
tends to significantly enhance their reasoning performance. However, this method has its drawbacks. First, the quality
of synthetic data is unpredictable; manually constructing graph reasoning data requires considerable expertise in
graph theory, making it a costly endeavor. Observations by Wang et al. [2023] and Fatemi et al. [2024] suggest that
even powerful closed-source LLMs possess limited graph reasoning capabilities and thus cannot reliably generate
high-quality graph reasoning data. Second, the specifics of what constitutes effective graph reasoning data and how to
best utilize this data for maximizing an LLM’s capabilities in graph reasoning remain under exploration.
To this end, we introduce the Graph Understanding for Natural Language Driven Analytical Model (GUNDAM),
which is designed to effectively comprehend graph structures and execute complex reasoning tasks over graphs.
GUNDAM integrates several pivotal components: Firstly, to encode graph structures into the LLM, we employ the
Graph Projection method. This approach effectively serializes the graph structure while preserving crucial information
that is comprehensible to the LLM. Secondly, we developed a pipeline to construct high-quality graph reasoning data,
which includes CoT reasoning paths. This pipeline leverages graph algorithms not only to ensure accuracy but also
to provide a detailed intermediate reasoning process. By systematically applying these algorithms, we can derive
structured reasoning paths that enhance the depth and clarity of the model’s reasoning ability. Finally, we introduce
an Alignment Tuning method, which fine-tunes GUNDAM using the graph reasoning data formulated. This tuning
process aligns the model’s reasoning capabilities with the structured graph reasoning data, further refining its ability to
process and reason about graph-based information effectively. The integration of these strategies enables GUNDAM to
harness and expand upon the potential of LLMs in understanding and reasoning about complex graph structures. The
contributions of this paper are summarized as follows:
• Data Construction: We have developed a pipeline for constructing high-quality graph reasoning data that
significantly enhances the reasoning capabilities of training data and improves model interpretability.
• Methodology: We introduce GUNDAM, specifically designed to enable LLMs to comprehend graph-structured
data and perform complex reasoning on graphs.
• Understanding LLM Capabilities in Graph Reasoning: Our experiments validate that GUNDAM achieves
SOTA performance while also identifying factors that influence LLMs’ graph reasoning capabilities. Addition-
ally, we provide a theoretical analysis of how reasoning paths enhance LLMs’ reasoning capabilities.
2 Related Work
2.1 LLM for Graph
In the realm of graph learning, substantial progress has been made by integrating Large Language Models (LLMs) with
Text-Attributed Graphs (TAGs), where text descriptions are present either within the nodes or on the entire graph Liu
et al. [2023], Ye et al. [2023], Chen et al. [2024a], He et al. [2024], Chen et al. [2023]. In this context, several notable
studies have emerged, focusing on diverse tasks such as node classification He et al. [2024], Liu et al. [2023], Chen
et al. [2023, 2024a], link prediction Bi et al. [2024], and graph classification Zhao et al. [2023], Qian et al. [2023].
One pivotal area of application is node classification within citation networks, where TAPE He et al. [2024] stands
out as a pioneering work. TAPE processes the titles and abstracts of research papers through an LLM to generate
predictions and explanations. These outputs are subsequently utilized as augmented features for training a Graph Neural
Network (GNN), allowing for improved classification performance. Addressing the challenge of sparse labeled data
in graph datasets, Chen et al. [2023] proposed an innovative pipeline that leverages LLMs to generate high-quality
2
GUNDAM A P REPRINT
annotated data. This inventive approach reduces the annotation burden while enhancing the utility of available graph
data. For handling graph data directly, LLaGA Chen et al. [2024a] proposes a novel approach by employing a soft
prompt technique. It retains the pre-trained parameters of the LLM and introduces a Projector, which is trained to map
node sequences into the token embedding space. This adjustment ensures that graph data are accommodated by the
LLM for effective prediction tasks. For link prediction, LPNL Bi et al. [2024] introduces a scalable method leveraging
LLMs. This method utilizes a two-stage sampling process for the source node and potential neighbor nodes to identify
anchor nodes. Prompts generated based on these anchor nodes are then fed into the LLM for accurate prediction. In
terms of classification, Qian et al. [2023] explore the use of LLMs for predicting molecular properties. This signifies
an extension of LLM applicability to a broader range of graph types and data characteristics. OFA Liu et al. [2023]
represents a comprehensive approach that designs a versatile method applicable to various graph-related tasks. This
method entails an intricate pipeline combining LLMs with GNNs. It makes use of subgraphs centered around nodes of
interest (NOIs) and creates prompt graphs, which are then integrated into the original graph, thereby enriching the input
for the GNN in an informed manner.
The application of LLMs to TAGs showcases both notable strengths and significant limitations. On one hand, LLMs
exhibit significant potential in addressing complex textual data, effectively capturing intricate textual patterns and
relationships, which are pivotal for tasks such as node classification. Their advanced capabilities in interpreting nuanced
text significantly enhance classification accuracy. On the other hand, the integration of these models with graph data
presents challenges. The conversion from graph structure to text sequences may result in verbose inputs that can be
computationally intensive, thus hindering scalability. Moreover, there is a concern that the structural properties of
graphs, such as topology and connectivity, are underutilized. This underutilization potentially limits the effectiveness of
LLMs in fully leveraging the rich structural information inherent in graphs, which is crucial for a more comprehensive
understanding and modeling of graph data.
Graph reasoning, an evolving field within graph representation learning and natural language processing, focuses on
leveraging graph structures to perform cognitive tasks such as shortest path finding and topological sorting. Significant
efforts Perozzi et al. [2024b], Chen et al. [2024b], Fatemi et al. [2024], Luo et al. [2024], Chai et al. [2023] have been
directed toward assessing and enhancing the reasoning capabilities of models over graph-structured data. The creation
of benchmarks for graph reasoning has played a pivotal role in this research direction. For instance, NLGraph Wang
et al. [2023] and GraphQA Fatemi et al. [2024] are notable projects that have independently established benchmarks
tailored for evaluating reasoning tasks on graphs. These benchmarks primarily utilize closed-source models like GPT to
explore the potential and limitations of LLMs in handling graph reasoning tasks. The initial explorations conducted
by these benchmarks revealed that LLMs tend to underperform in fundamental graph reasoning tasks, underscoring a
critical space for further development in model capabilities and training methods.
In an innovative approach to broaden the spectrum of graph reasoning challenges, VisionGraph Li et al. [2024]
introduced a multimodal graph reasoning benchmark. This benchmark uniquely represents each graph as an image,
thereby shifting the challenge towards visual Question Answering (QA) tasks. This approach not only diversifies the
types of inputs that models must handle but also tests the adaptability of reasoning models in interpreting and processing
information across differing data modalities. Luo et al. [2024] introduced GraphInstruct, a benchmark specifically
designed for instruction fine-tuning on graph reasoning tasks. This development paved the way for further innovations
in the field, exemplified by GraphWiz Chen et al. [2024b]. GraphWiz represents the first open-source LLM dedicated to
solving a variety of graph problems through explicit reasoning. In another notable development, Perozzi et al. [2024a]
proposed a method employing soft prompt technology that utilizes GNNs to map graph structures into the embedding
space of LLMs. This technique allows LLMs to process graph data directly, thereby expanding their applicability
in performing graph reasoning tasks. Additionally, significant work has been done in applying LLMs to TAGs for
reasoning tasks. For instance, Sun et al. [2023] and Luo et al. [2023] have focused on extracting genuine and effective
relational paths within knowledge graphs. Their approaches aim to assist LLMs in generating faithful and interpretable
reasoning outputs. By identifying and sourcing accurate relational paths, these models can generate more credible and
logically consistent results in complex reasoning scenarios.
3 Methodology
3.1 Preliminaries
Graph A graph is denoted as G(V, E), where V and E represent the set of nodes and edges respectively. (u, v, w) ∈ E
denotes an edge, where u, v ∈ V are the two nodes of the edge and w denotes the weight of the edge.
3
GUNDAM A P REPRINT
Training
Answer:
Intermediate steps: Answer:
🔥 Hard path:
Tuned
Rewrite
Data flow
Training pipeline Inference pipeline Text sequence
Path flow
Graph Reasoning Graph reasoning is a fundamental component in the field of graph representation learning, where
it focuses on inferring implicit relationships and attributes from nodes, edges, or subgraphs within structured graph
data. This area involves critical operations such as information propagation, feature extraction, and pattern recognition,
which are essential for supporting high-level reasoning and decision-making tasks across various applications including
social network analysis Hung et al. [2020], knowledge graph Sun et al. [2023], Luo et al. [2023], etc.
In this paper, we specifically concentrate on graph reasoning that explicitly utilizes the structural aspects of graphs
without relying on external textual data. Our focus is directed towards tasks such as identifying connectivity, performing
topological sorting, finding shortest paths, and determining maximum flow within graphs. These tasks reveal the
spatial and relational dynamics intrinsic to graph structures, thereby providing a deeper understanding of graph theory
applications in practical scenarios.
In this section, we introduce the GUNDAM, illustrated in Figure 1, which outlines our strategy to empower an LLM to
effectively process graph data inputs and undertake complex reasoning based on graph structures. To realize this goal,
we need to address three pivotal challenges: 1) How can we encode graph structures for LLM input? We address this by
adopting the Graph Projection method, which transforms graph structures into textual sequences that retain essential
graph information in a format processable by the LLM. 2) How do we construct correct and diverse alignment data? We
ensure data accuracy through the use of graph algorithms, while diversity is augmented by exploiting the generative
capabilities of the LLM. 3) How can we enhance the graph reasoning capabilities of an LLM? This is achieved via
Alignment Tuning, where we fine-tune the LLM specifically to enhance its performance on graph-based reasoning tasks.
Additionally, we theoretically investigate the benefits brought by CoT reasoning path.
Graph Projection LLMs are not inherently equipped to process graph data directly. To facilitate LLMs in compre-
hending graph data and executing reasoning tasks, it is essential to first transform the graph data into a format that can
be understood by LLMs. Extensive research Perozzi et al. [2024a], Fatemi et al. [2024] has focused on developing
encoding strategies to render graph data compatible with LLMs. These strategies can primarily be classified into
two branches: Graph2Text and Graph2Vec. The Graph2Text method translates graph data into textual sequences that
preserve the structural and relational integrity in a sequential format suitable for direct ingestion by LLMs. Conversely,
the Graph2Vec approach transforms graphs into vector representations, which are subsequently mapped into the
embedding space of LLMs.
In this paper, we choose to utilize the Graph2Text method to convert graph data into a format suitable for processing
by LLMs. Specifically, we describe the structure of the graph using triples of the form (u, v, w), as elaborated in
Section 3.1. To illustrate, consider a graph from Figure 1 which can be represented as a sequence of such triples
in Example: Graph Projection:
4
GUNDAM A P REPRINT
This is a undirected graph, where (u, v, w) denotes that node u and node v are connected by an undirected edge
with the weight w.
(0, 2, 0.8) (0, 3, 0.4) (1, 2, 0.5) (2, 3, 0.3)
This structured textual representation preserve and convey the graph’s relational information in a form that is com-
prehensible for LLMs, thereby facilitating efficient graph-based reasoning tasks. While alternative graph encoding
approaches, such as using an adjacency matrix, are feasible and convenient.
Graph Reasoning Paths via Graph Algorithms Accurate and diverse alignment data is crucial for graph reasoning
tasks; however, collecting high-quality alignment data is non-trivial. Conventional methods, such as human or AI-
powered annotation, exhibit distinct drawbacks. Manual annotation necessitates fundamental expertise in graph theory,
leading to prohibitively high costs. Conversely, directly adopting AI tools for annotation fails to ensure accuracy.
We analyze this phenomenon in our experiments in Section 4, where even a powerful model like GPT-4 exhibits
limited graph reasoning capability. To address this issue, we have devised a straightforward yet effective method to
construct graph reasoning data. We utilize established graph algorithms to solve graph reasoning problems, meticulously
recording both the solution processes and the answers. For example, we employ the Breadth-First Search (BFS)
algorithm for connectivity tasks, Depth-First Search (DFS) for cycle detection tasks, and Dijkstra’s algorithm for
shortest path tasks. By doing so, we guarantee the correctness of the answers. We refer to these directly obtained correct
answers as Plain Answers (PA).
Example: PA
Recent work Merrill and Sabharwal [2023a] has shown that transformer decoders, without any intermediate steps, can
only solve problems that lie within a fairly small circuit complexity class Merrill and Sabharwal [2023b]. Therefore,
constructing answers with intermediary steps is crucial. Existing studies Hsieh et al. [2023], Chen et al. [2024b]
demonstrate that integrating CoT reasoning pathways can significantly enhance the reasoning capabilities of LLMs.
One straightforward approach is to use a powerful LLM to generate reasoning paths when given a question and its
corresponding Plain Answer (PA). However, as we will verify in Section 4, even when provided with the correct answer,
GPT-4 struggles to generate accurate reasoning paths. Fortunately, when utilizing graph algorithms to solve graph
reasoning problems, we concurrently record both the answers and their solution processes. This allows us to leverage
these solution processes to construct manually designed reasoning paths, which we denote as Answer with Hard Path
(AHP).
Example: AHP
However, the manual construction of rule-based AHPs tends to result in uniform answer formats, potentially causing
the model to overfit to this specific format Zhang et al. [2024]. This lack of diversity in training can lead to a decrease
in the model’s generalization performance. To address this issue, we employ LLMs to rewrite the AHP-generated
answers, thereby enhancing their diversity. We refer to the answers obtained through this method as Answers with
Soft Path (ASP). It is noteworthy that when provided with AHP, the task facing the LLM shifts from graph reasoning
to sentence rewriting. Sentence rewriting is inherently well-suited to LLMs due to their proficiency in language
manipulation.
5
GUNDAM A P REPRINT
Example: ASP
This approach not only ensures the correctness of the reasoning process but also provides a structured method to
enhance LLMs’ graph reasoning capability through accurate and logically sound CoT data.
Alignment Tuning We fine-tune LLMs using datasets that include graph reasoning paths to align them bet-
ter with graph understanding and enable reasoning based on graph structures. Our training data set, D =
N
{(Gi , Ti , Qi , Ri , Ai )}i=1 , consists of N quintuples where each element represents a graph Gi , a task description
Ti , a query Qi , a reasoning path Ri , and an answer Ai respectively.
Initially, we employ the Graph Projection method to transform each graph Gi into a textual sequence, denoted as
SGi = fP (Gi ) , where fP is the Graph Projection function. The reasoning path Ri is obtained through graph algorithms,
expressed as Ri = fR (Gi , Ti , Qi ), where fR is the function that uses graph algorithms to solve the graph reasoning task
and yields the reasoning path. The training objective is to maximize the probability of generating the correct answers
based on this structured input:
max p(Ai |SGi , Ti , Qi , Ri ), (1)
Gθ
where Gθ is the parameter of GUNDAM. To enable GUNDAM to generate intermediate reasoning processes that aid
in predicting the final answer, we formulate this as (R̂i , Âi ) = fG (SGi , Ti , Qi ), where fG is the inference function of
GUNDAM. The training objective function is defined as
L = LA + λLR , (2)
PN PN
where LA = N1 i=1 ℓ Âi , Ai represents the answer prediction loss, and LR = N1 i=1 ℓ R̂i , Ri is the reasoning
path generation loss. ℓ is the cross-entropy loss between the predicted and target tokens and λ is a hyperparameter.
In this section, we provide a theoretical analysis of how CoT reasoning paths can enhance the reasoning capabilities
of LLMs, thereby facilitating the generation of correct answers with greater ease. Due to the complexities of graph
reasoning tasks, LLMs without robust reasoning abilities often fail to generate intermediate responses Z for the ultimate
correct answer a. As analyzed above, the utilization of an explicit CoT reasoning path R can boost the LLM’s reasoning
ability, thus enabling more accurate outcomes.
Theorem 1. Given the following conditions:
1. Non-triviality: The reasoning path R provides non-trivial information about the responses Z, such that
H(R|Z) > 0.
2. Relevance: The reasoning path R contains information relevant to the correct answer a that is not fully
captured by the response Z, such that I(a; R|Z) > 0.
In Theorem 1, H(·|·) represent the conditional entropy and I(·; ·) denotes the mutual Information. The conditions
coincides with observations from previous studies Merrill and Sabharwal [2023a], Chiang et al. [2023a] that point to
LLMs’ shortcomings in handling sequential reasoning challenges, such as simulating finite state machines, determining
connectivity in graphs, or solving matrix equations. The reasoning typically requires a series of logical steps and
transformations that a simple direct model output Z might not fully capture. We provide a proof of Theorem 1 in
Appendix A. In practice, the reasoning path R encapsulates progressive, structured reasoning or derivation steps leading
to Z. Therefore, knowing R reduces the uncertainty about a more effectively than knowing just Z. This theoretical
analysis hinges on the nature of R providing supplementary, clarifying information beyond what Z alone offers, aligning
with principles of information theory MacKay [2003] where additional context reduces entropy.
6
GUNDAM A P REPRINT
4 Experiment
In this section, we conduct experiments to address two critical research questions (RQ):
• RQ1: How does GUNDAM perform in graph reasoning tasks compared to current SOTA open-source and
closed-source LLMs?
• RQ2: What factors significantly impact the graph reasoning capabilities of LLMs?
Dataset We conducted experimental validation on the NLGraph benchmark Wang et al. [2023], a comprehensive
graph reasoning benchmark designed to evaluate performance across a spectrum of graph reasoning tasks. This
benchmark encompasses eight distinct levels and complexities of tasks, namely: Connectivity, Cycle, Topological
Sort, Shortest Path, Maximum Flow, Bipartite Graph Matching, Hamilton Path, and Graph Neural Networks. Detailed
statistical information about the dataset and elaborate descriptions of the eight graph reasoning tasks are provided in
Appendix C.
Baselines In our experimentation, we opted for a comparative analysis using both closed-source and open-source
models as baselines. Specifically, we selected two closed-source models, GPT-4 Achiam et al. [2023] (version gpt-4-
0125-preview) and GPT-3.5 (version gpt-3.5-turbo-1106), alongside two open-source models, Vicuna-7B Chiang et al.
[2023b] (version vicuna-7b-v1.5) and Llama3-8B Dubey et al. [2024] (version Llama 3.1 8B Instruct). Furthermore,
we utilized Vicuna-7B and Llama3-8B as the base models for training two additional models, respectively denoted as
GUNDAM-V and GUNDAM-L in our study.
Settings We follow the settings of NLGraph Wang et al. [2023] for the dataset split, prompt, and the evaluation of the
experimental results. Specifically, for all the eight tasks, we use accuracy as the evaluation metric. All experiments are
conducted on an 8∗A800 machine. The learning rate is set to 2e-5 and the hyperparameter λ is set to 1. More detailed
experimental settings are available in Appendix D.
We conducted experimental validations on eight graph reasoning tasks in a zero-shot setting. The accuracy for each
difficulty level, as well as the average accuracy (Avg.), is provided in Table 1. Both open-source models, Vicuna-7B
and Llama3-8B, exhibited poor performance across all tasks. Despite its advanced generative capabilities, the newly
released Llama3-8B showed slightly better results than Vicuna-7B; however, both models demonstrated limited graph
reasoning capabilities. The closed-source models, GPT-3.5 and GPT-4, displayed fundamental graph reasoning abilities.
On simpler tasks such as Connectivity, their performance markedly surpassed that of the open-source models. However,
7
GUNDAM A P REPRINT
when faced with complex graph reasoning tasks such as Maximum Flow and GNN, GPT-3.5 fell short. Although
GPT-4 showed somewhat improved results over GPT-3.5, its performance was still suboptimal. GUNDAM-V and
GUNDAM-L demonstrated commendable performance across all eight tasks, surpassing the advanced generative model
GPT-4 with parameter sizes of 7B and 8B, respectively. They exceeded their corresponding base models by 38.36% and
49.53%, validating the effectiveness of our proposed reasoning path construction and Alignment Tuning in enhancing
LLMs’ capabilities in graph reasoning. This indicates that LLMs are inherently capable of graph reasoning but require
appropriate alignment with graph reasoning tasks to exhibit this capacity. Notably, GUNDAM-L achieved exceptionally
high accuracy in the GNN task, suggesting that LLMs can effectively simulate two-layer graph convolution operations
on relatively small-scale graphs, which involve complex multi-step reasoning.
GPT-3.5
GPT-4 CoT Few-Shot CoT Few-Shot
Vicuna-7b CoT Few-Shot
GUNDAM
100 100 70
90 90 60
80 50
80 40
70 70 30
60 60 20
50 10
0-CoT 50 Zero-Shot 0-CoT Zero-Shot 0-CoT 0 Zero-Shot
In this section, we aim to investigate the impact of reasoning paths, prompt techniques, and training data difficulty on
the reasoning abilities. Unless specified otherwise, all subsequent experiments use Vicuna-7B as the base model, i.e.,
GUNDAM-V.
Reasoning Path In this subsection, we further investigate how reasoning paths influence the graph reasoning
performance of LLMs. We conducted experiments on the Connectivity, Cycle, and Shortest Path tasks, with the average
accuracy presented in Table 2. More detailed results are provided in Appendix E.1. We use AGP to denote Answers
with Generated Path, where reasoning paths are generated by GPT-4. Additionally, following NLGraph Wang et al.
[2023], we include a Random baseline, which randomly assigns “Yes” or “No” to the Connectivity and Cycle tasks,
with an expected accuracy of 50%. For the Shortest Path task, we randomly pick a valid path and the sum of the weights
along the path as the answer.
The results from models trained on PA, AGP, Answers with Hard Path (AHP), and ASP surpassed random outcomes,
indicating that aligning LLMs with data containing correct answers generally enhances their reasoning capabilities
to various extents. PA showed moderate performance across the three tasks, suggesting that data with only answers
and no intermediate processes provide limited improvement in LLMs’ reasoning abilities. AGP performed well in the
Connectivity task but was less effective in the Cycle and Shortest Path tasks. This is likely because AGP’s reasoning
paths, generated by GPT-4, are reliable in simpler tasks such as Connectivity, where GPT-4 can generate correct paths.
8
GUNDAM A P REPRINT
However, GPT-4 struggles with more complex reasoning tasks, leading to incorrect reasoning paths and, consequently,
incorrect answers in the Cycle and Shortest Path tasks. Both AHP and ASP consistently demonstrated superior
performance across all tasks, reaffirming that integrating reasoning paths can indeed strengthen LLMs’ reasoning
abilities, provided these paths are accurate. This also confirms that our methodology for constructing reasoning paths is
effective. Furthermore, while AHP and ASP achieved comparable performance, models aligned with ASP data could
generate more diverse responses, showcasing the added benefit of ASP in enhancing response variability.
Prompt We further explored the impact of prompt techniques on the graph reasoning performance of LLM. We
conducted experiments using five distinct prompt techniques, with results depicted in Figure 2. Typically, zero-shot
configurations yielded relatively good or even the best results. However, advanced prompting techniques such as
CoT and SC resulted in negligible performance improvements. These findings suggest that merely employing prompt
techniques struggles to significantly enhance LLMs’ understanding and reasoning capabilities on graphs. In contrast, our
proposed Alignment Tuning offers a viable solution to effectively bolster the graph reasoning abilities of LLMs. This
insight highlights the limitations of conventional prompting strategies and underscores the importance of tailor-made
adaptations such as Alignment Tuning to fully leverage LLMs in complex reasoning tasks involving graph-structured
data.
90
Easy 96.43 92.00 44.83 80 Easy 91.07 88.00 55.17 80
60 70
Acc
Acc
Medium 87.24 51.40 Medium 88.27 89.72 60
40
50
30
Connectivity Cycle Shortest Path Connectivity Cycle Shortest Path
(a) Easy to Hard. (b) Hard to Easy.
Data Difficulty To investigate the impact of data difficulty on model performance, we conducted transfer experiments
using datasets of varying difficulties. Specifically, we implemented two scenarios: 1) easy to hard, where we trained
models with easy-level data and tested them on datasets of increasing difficulty to evaluate the transferability of
reasoning capabilities; and 2) hard to easy, where we trained models on hard-level data and tested them on on varying
difficulties. The results are presented in Figure 3.
For simpler reasoning tasks such as Connectivity, models trained on easy data demonstrated effective transferability
to other difficulty levels. However, for more complex reasoning tasks, the effectiveness of transfer from easy data
was limited. Conversely, models trained on hard data showed significant transferability across all tested difficulty
levels, suggesting that using hard data in training substantially enhances the LLM’s graph reasoning capabilities. This
effectiveness may be attributed to the intricate reasoning processes invoked by hard data, which likely induces the
model to generate more intermediate steps, thereby enhancing reasoning depth and robustness. This observation aligns
with findings in Merrill and Sabharwal [2023a], which analyzed how the length of a transformer’s chain of thought
impacts its reasoning power. Moreover, while models trained solely on hard data performed better than those trained on
easy data, they underperformed compared to models trained on a mixed-difficulty dataset. This indicates that, although
hard data contributes significantly to enhancing reasoning capabilities, a hybrid training approach incorporating datasets
of various difficulties yields the best results. This mixture presumably builds a more robust and versatile reasoning
model, capable of handling both simple and complex graph reasoning tasks effectively.
In Figure 4, we present case studies for two complex tasks: Maximum Flow and GNN. Results from the two base
models are omitted for these tasks due to their inability to handle such complexity. For the Maximum Flow task,
GPT-3.5 identified an algorithm capable of solving the problem but failed to detail the necessary intermediate processes,
resulting in an incorrect answer. GPT-4, on the other hand, offered more detailed intermediate results but erroneously
9
GUNDAM A P REPRINT
Graph
Query What is the maximum flow from node 5 to node 8 ? What's the embedding of each node after two layers of
simple graph convolution layer?
1. Start with an initial flow of 0. Layer 1:
2. While there exists a path from node 5 to node 8 in the resi node 0: [1,1] + [0,0] + [0,0] = [1,1].
dual graph (the original graph in the beginning), find the ma node 1: [0,0] = [0,0].
ximum flow through this path and update the residual graph. node 2: [1,1] + [0,1] = [1,2].
3. Repeat step 2 until no more paths from node 5 to node 8 e node 3: [1,1] + [1,2] = [2,3].
GPT3.5 xist in the residual graph. node 4: [1,1] = [1,1].
Using the Ford-Fulkerson algorithm, we find that the maxim Layer 2:
um flow from node 5 to node 8 is 5. node 0: [1,1] + [2,3] + [1,2] = [4,6].
node 1: [1,2] = [1,2].
node 2: [4,6] + [0,0] = [4,6].
node 3: [1,1] + [4,6] = [5,7].
node 4: [1,1] = [2,2].
One possible path from node 5 to node 8 would pass throug first layer:
h node 6. Node 0's new embedding [1,1] + [0,0] + [0,0] = [1,1]
Let's calculate the flow along this path: Node 1's new embedding [0,0]
- Node 5 to node 6 has an edge with capacity 3. Node 2's new embedding [1,1] + [0,1] + [0,0] = [1,2]
- Node 6 to node 8 has an edge with capacity 8. Node 3's new embedding [1,1] + [0,0] = [1,1]
The bottleneck in this path is the edge from node 5 to node Node 4's new embedding [1,1]
GPT4 6, with a capacity of 3. So, this path can contribute a maxim second layer:
um flow of 3. Node 0's new embedding [1,1] + [1,1] + [1,2] = [3,4]
There are no other paths available from node 5 to node 8. Node 1's new embedding [1, 2]
Thus, the maximum flow from node 5 to node 8 in this grap Node 2's new embedding [3,4] + [0,0] + [1,1] = [4,5]
h is 3. Node 3's new embedding [3,4] + [1,2] = [4,6]
Node 4's new embedding [3,4]
From the source (node 5), we can send a maximum of 2 unit Layer 1:
s of flow to node 8. node 0: [1, 1] + [0, 0] + [0, 0] = [1, 1].
Path 5 -> 0 -> 8: 2 units of flow node 1: [0, 0] = [0, 0].
2 units of flow from node 5 to node 0 node 2: [1, 1] + [0, 1] + [0, 0] = [1, 2].
2 units of flow from node 0 to node 8 node 3: [1, 1] + [0, 0] = [1, 1].
The total unit of flow arrives at node 8 is 2. So the maximu node 4: [1, 1] = [1, 1].
GUNDAM-L Layer 2:
m flow from node 5 to node 8 is 2.
node 0: [1, 1] + [1, 1] + [1, 2] = [3, 4].
node 1: [1, 2] = [1, 2].
node 2: [1, 1] + [0, 0] + [1, 1] = [2, 2].
node 3: [1, 1] + [1, 2] = [2, 3].
node 4: [1, 1] = [1, 1].
For the case of GNN task, due to its complexity requiring the generation of extensive intermediate steps, we have excerpted only the key segments of the model's output.
reversed the direction of the edge from node 8 to node 6, leading to incorrect intermediate steps and consequently a
wrong conclusion. Conversely, GUNDAM-L not only provided correct intermediate steps but also accurately reasoned
the correct answer.
Regarding the GNN task, which requires modeling two layers of graph convolution operations involving complex
multi-step reasoning, GPT-3.5 made errors in simulating even the first layer of convolution. It incorrectly updated
the embedding of node 3 using the post-update embedding of node 2 as if they were the original embedding. GPT-4
accurately modeled the first layer but replicated the same error as GPT-3.5 in the second layer, wrongly using the
updated embedding of node 0 for updating nodes 2, 3, and 4 in that layer, leading to incorrect final embedding for these
nodes. GUNDAM-L, however, executed correct feature updates at each layer, resulting in the correct outcomes. From
these case studies, it is apparent that GPT-4 exhibits stronger reasoning capabilities than GPT-3.5 but still encounters
challenges with tasks requiring complex and multi-step reasoning, occasionally leading to erroneous reasoning outcomes.
In contrast, GUNDAM-L demonstrated robust graph reasoning proficiency, effectively understanding and manipulating
complex graph structures through multi-step reasoning. Furthermore, GUNDAM-L not only produces correct answers
but also generates a reasoning path that elucidates how those answers were derived, enhancing the interpretability of the
model’s outputs.
10
GUNDAM A P REPRINT
5 Conclusion
To enhance the ability of LLMs to understand graph data and perform reasoning tasks based on graph structures, we
introduce GUNDAM. It employs a Graph Projection method to convert graph structures into textual formats that LLMs
can process and constructs CoT reasoning data via graph algorithms. Furthermore, we propose Alignment Tuning to
effectively align LLMs with graph reasoning tasks. Experiments conducted on eight graph reasoning tasks demonstrate
the efficacy of GUNDAM, validating its utility in enabling sophisticated graph-based reasoning capabilities in LLMs.
References
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.
Arxiv Preprint Arxiv:2302.13971, 2023.
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida,
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. Arxiv Preprint Arxiv:2303.08774,
2023.
AI Anthropic. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card, 2024.
Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao,
et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural
Information Processing Systems, 36, 2024.
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in Neural Information
Processing Systems, 36, 2024.
Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, and Yulia Tsvetkov. Can language models
solve graph problems in natural language? In Thirty-seventh Conference on Neural Information Processing Systems,
2023. URL https://openreview.net/forum?id=UDqHhbqYJV.
Xiaoxin He, Xavier Bresson, Thomas Laurent, Adam Perold, Yann LeCun, and Bryan Hooi. Harnessing explanations:
Llm-to-lm interpreter for enhanced text-attributed graph representation learning. In The Twelfth International
Conference on Learning Representations, 2024.
Hao Liu, Jiarui Feng, Lecheng Kong, Ningyue Liang, Dacheng Tao, Yixin Chen, and Muhan Zhang. One for all:
Towards training one graph model for all classification tasks. Arxiv Preprint Arxiv:2310.00149, 2023.
Bahare Fatemi, Jonathan Halcrow, and Bryan Perozzi. Talk like a graph: Encoding graphs for large language models.
In The Twelfth International Conference on Learning Representations, 2024.
Xuanwen Huang, Kaiqiao Han, Yang Yang, Dezheng Bao, Quanjin Tao, Ziwei Chai, and Qi Zhu. Can gnn be good
adapter for llms? In Proceedings of the ACM on Web Conference 2024, pages 893–904, 2024.
Runjin Chen, Tong Zhao, Ajay Jaiswal, Neil Shah, and Zhangyang Wang. Llaga: Large language and graph assistant.
arXiv preprint arXiv:2402.08170, 2024a.
Yongji Wu, Defu Lian, Yiheng Xu, Le Wu, and Enhong Chen. Graph convolutional networks with markov random field
reasoning for social spammer detection. In Proceedings of the AAAI conference on artificial intelligence, volume 34,
pages 1054–1061, 2020.
Shuangli Li, Jingbo Zhou, Tong Xu, Liang Huang, Fan Wang, Haoyi Xiong, Weili Huang, Dejing Dou, and Hui Xiong.
Structure-aware interactive graph neural networks for the prediction of protein-ligand binding affinity. In Proceedings
of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 975–985, 2021.
Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat
Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. Arxiv
Preprint Arxiv:2303.12712, 2023.
Wenqi Fan, Yao Ma, Qing Li, Yuan He, Eric Zhao, Jiliang Tang, and Dawei Yin. Graph neural networks for social
recommendation. In The world wide web conference, pages 417–426, 2019.
Xu Yuan, Chengchuan Xu, Peng Li, and Zhikui Chen. Relational learning with hierarchical attention encoder and
recoding validator for few-shot knowledge graph completion. In Proceedings of the 37th ACM/SIGAPP Symposium
on Applied Computing, pages 786–794, 2022.
Nuo Chen, Yuhan Li, Jianheng Tang, and Jia Li. Graphwiz: An instruction-following language model for graph
problems. Arxiv Preprint Arxiv:2402.16029, 2024b.
11
GUNDAM A P REPRINT
Zihan Luo, Xiran Song, Hong Huang, Jianxun Lian, Chenhao Zhang, Jinqi Jiang, Xing Xie, and Hai Jin. Graphin-
struct: Empowering large language models with graph understanding and reasoning capability. arXiv preprint
arXiv:2403.04483, 2024.
Bryan Perozzi, Bahare Fatemi, Dustin Zelle, Anton Tsitsulin, Mehran Kazemi, Rami Al-Rfou, and Jonathan Halcrow.
Let your graph do the talking: Encoding structured data for llms. Arxiv Preprint Arxiv:2402.05862, 2024a.
Ruosong Ye, Caiqi Zhang, Runhui Wang, Shuyuan Xu, and Yongfeng Zhang. Natural language is all a graph needs.
arXiv preprint arXiv:2308.07134, 2023.
Zhikai Chen, Haitao Mao, Hongzhi Wen, Haoyu Han, Wei Jin, Haiyang Zhang, Hui Liu, and Jiliang Tang. Label-free
node classification on graphs with large language models (llms). arXiv preprint arXiv:2310.04668, 2023.
Baolong Bi, Shenghua Liu, Yiwei Wang, Lingrui Mei, and Xueqi Cheng. Lpnl: Scalable link prediction with large
language models. arXiv preprint arXiv:2401.13227, 2024.
Haiteng Zhao, Shengchao Liu, Ma Chang, Hannan Xu, Jie Fu, Zhihong Deng, Lingpeng Kong, and Qi Liu. Gimlet:
A unified graph-text model for instruction-based molecule zero-shot learning. Advances in Neural Information
Processing Systems, 36:5850–5887, 2023.
Chen Qian, Huayi Tang, Zhirui Yang, Hong Liang, and Yong Liu. Can large language models empower molecular
property prediction? arXiv preprint arXiv:2307.07443, 2023.
Bryan Perozzi, Bahare Fatemi, Dustin Zelle, Anton Tsitsulin, Mehran Kazemi, Rami Al-Rfou, and Jonathan Halcrow.
Let your graph do the talking: Encoding structured data for llms. arXiv preprint arXiv:2402.05862, 2024b.
Ziwei Chai, Tianjie Zhang, Liang Wu, Kaiqiao Han, Xiaohai Hu, Xuanwen Huang, and Yang Yang. Graphllm: Boosting
graph reasoning ability of large language model. arXiv preprint arXiv:2310.05845, 2023.
Yunxin Li, Baotian Hu, Haoyuan Shi, Wei Wang, Longyue Wang, and Min Zhang. Visiongraph: Leveraging large
multimodal models for graph theory problems in visual context. arXiv preprint arXiv:2405.04950, 2024.
Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Heung-Yeung Shum, and Jian
Guo. Think-on-graph: Deep and responsible reasoning of large language model with knowledge graph. arXiv
preprint arXiv:2307.07697, 2023.
Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. Reasoning on graphs: Faithful and interpretable large
language model reasoning. arXiv preprint arXiv:2310.01061, 2023.
Man Hung, Evelyn Lauren, Eric S Hon, Wendy C Birmingham, Julie Xu, Sharon Su, Shirley D Hon, Jungweon
Park, Peter Dang, and Martin S Lipsky. Social network analysis of covid-19 sentiments: Application of artificial
intelligence. Journal of medical Internet research, 22(8):e22590, 2020.
William Merrill and Ashish Sabharwal. The expresssive power of transformers with chain of thought. arXiv preprint
arXiv:2310.07923, 2023a.
William Merrill and Ashish Sabharwal. The parallelism tradeoff: Limitations of log-precision transformers. Transactions
of the Association for Computational Linguistics, 11:531–545, 2023b.
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna,
Chen-Yu Lee, and Tomas Pfister. Distilling step-by-step! outperforming larger language models with less training
data and smaller model sizes. Arxiv Preprint Arxiv:2305.02301, 2023.
Liang Zhang, Katherine Jijo, Spurthi Setty, Eden Chung, Fatima Javid, Natan Vidra, and Tommy Clifford. Enhancing
large language model performance to answer questions and extract information more accurately. arXiv preprint
arXiv:2402.01722, 2024.
David Chiang, Peter Cholak, and Anand Pillay. Tighter bounds on the expressivity of transformer encoders. In
International Conference on Machine Learning, pages 5544–5562. PMLR, 2023a.
David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao
Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See
Https://Vicuna. Lmsys. Org (Accessed 14 April 2023), 2:6, 2023b.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil
Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
2024.
12
GUNDAM A P REPRINT
A Proof of theorem
Theorem 2. Given the following conditions:
1. Non-triviality: The reasoning path R provides non-trivial information about the responses Z, such that
H(R|Z) > 0.
2. Relevance: The reasoning path R contains information relevant to the correct answer a that is not fully
captured by the response Z, such that I(a; R|Z) > 0.
Intuition The main idea is that the reasoning path R provides additional information about the correct answer a
beyond what is provided by the response Z. Therefore, having access to R should reduce the uncertainty about a, hence
reducing the conditional entropy.
Thus, this completes the proof that having access to the reasoning path R in addition to the responses Z reduces the
uncertainty about the correct answer a, proving H(a|Z, R) < H(a|Z).
B Details of training
Constructing Answers with Soft Paths (ASP) essentially involves a sentence rewriting task, an area in which LLMs
demonstrate substantial proficiency. For this purpose, researchers can utilize any commonly available open-source
or proprietary LLM. In this paper, we specifically employ GPT-4 to undertake this task, owing to its advanced text
generation and transformation capabilities. To facilitate ASP construction using GPT-4, we employ two commonly
used prompts. These prompts strategically guide GPT-4 to reformulate the initial Answer with Hard Path (AHP) into a
more varied and generalized form, enhancing the diversity and robustness of the model outputs.
13
GUNDAM A P REPRINT
Template1.
Given an question [question] and its answer [answer], derive the shortest yet complete Chain of Thought reasoning
path that leads to the answer.
[begin of question]
{question}
[end of question]
[begin of answer]
{answer}
[end of answer]
Template2.
Rewrite the following sentence to improve its logical flow:
{answer}
C Dataset
C.1 Dataset statistic.
Subset Connectivity Cycle Topo. Sort Shortest Path Max. Flow Matching Hamilton GNN
#Easy 296/56 125/25 151/29 151/29 124/26 249/51 125/25 77/23
Size n: 5-10 n: 5-10 n: 5-10 n: 5-10 n: 5-10 n: 6-20 n: 5-10 n: 5-8
#Medium 1004/196 493/107 372/78 / / / / /
Size n: 11-25 n: 11-25 n: 11-25 / / / / /
#Hard 561/119 341/59 152/28 165/35 168/32 177/33 167/33 124/16
Size n: 26-35 n: 26-35 n: 26-35 n: 11-20 n: 11-20 n: 17-33 n: 11-20 n: 9-15
Total 1861/371 959/191 675/135 316/64 292/58 426/84 292/58 201/39
Size denotes the number of nodes in graphs. A/B denotes the number of samples in the training/test set respectively.
• Connectivity: Given an undirected graph G and two nodes u and v, determine whether there is a path to
connect them, following u → vi → · · · → vj → v.
• Cycle: Given an undirected graph G, determine whether there is a cycle in the graph, following u → vi →
· · · → vj → u.
• Shortest Path: Given a weighted undirected graph G and two nodes u and v, the shortest path is the one that
connects the two nodes and minimizes the sum of the weights of the edges, such that
X
P = arg min wij ,
P ∈Puv
(vi ,vj )∈P
where Puv denotes the set of all possible paths from u to v in the graph G, and (vi , vj ) ∈ P represents each
edge in the path P with weight wij .
• Topological Sort: A topological sort of a directed graph arranges its nodes in a sequence such that for every
directed edge (u, v) from node u to node v, node u precedes node v in the sequence. The goal is to determine a
valid topological order for a given directed graph, acknowledging that there may be multiple correct solutions.
We task LLMs with producing a valid topological sort for the given graph and then use an external program to
verify its accuracy.
• Maximum Flow: Consider a network G = {V, E} with two specific nodes s and t representing the source and
sink, respectively. Each edge in the network has an associated capacity c, which indicates the maximum flow
14
GUNDAM A P REPRINT
that can pass through it. The task is to instruct LLMs to devise a strategy to maximize the flow from the source
to the sink. The evaluation criteria include both exact matches with the optimal solution and partial credit. The
partial credit is calculated as follows:
, if t ≤ s
t
PC = s ,
0, if t > s
where s represents the flow value in the optimal plan, and t is the flow value in the solution generated by the
LLMs.
• Bipartite Graph Matching: In an undirected graph G = {V, E}, a matching refers to a collection of edges
such that no two edges share a common node. A bipartite graph is a special type of graph where the set of
nodes can be partitioned into two disjoint sets U and V, with no edges between nodes within the same set.
The goal is to find a matching in the bipartite graph that includes the maximum possible number of edges. We
then use an external program to assess whether the solution generated by the LLMs is both valid and optimal.
• Hamilton Path: A Hamilton path in an undirected graph is a route that passes through each node exactly
one time. For a given undirected graph G = {V, E}, the objective is to identify a valid Hamilton path. We
preprocess the base graphs to guarantee the existence of at least one Hamilton path and then use an external
program to assess the solution provided by the LLM.
• Graph Neural Networks: Given an undirected graph G = {V, E} and a two-dimensional embedding xi for
each node, the goal is to execute ℓ layers of message passing. This involves updating each node’s embedding
by summing the embeddings of all its neighboring nodes. Formally, this update is represented as:
(ℓ+1)
X (ℓ)
xi = xj ,
j∈Ni
where Ni represents the set of neighbors for node i, and (ℓ) indicates the ℓ-th layer. The evaluation for this
task includes exact matches with the correct node embeddings and two types of partial credits. The first partial
credit is based on the percentage of nodes with correct embeddings (PC), while the second involves the average
|x−y|
of the relative errors across all embedding dimensions (RE). The relative error is calculated as RE = max(x,y) ,
where x is the value generated by the LLMs, and y is the corresponding value in the standard answer, averaged
over all embedding dimensions.
D Experiment Setting
For the FEW-SHOT based prompt techniques (including FEW-SHOT, CoT and CoT-SC), the input prompt contains K
examples of pertinent questions and answers. For Connectivity and Cycle tasks, K is 4 and K is set to 5 for Shortest
Path task. For the self-consistency (SC) prompt method, we sample 5 responses. All experiments are conducted on an
8*A800 machine. We employ Vicuna-7B and Llama3-8B as the base model for GUNDAM, the learning rate is set to
2e-5, and the batchsize is set to 8. For evaluation, the temperature factor τ is set to 0.2 except for the self-consistency
Prompt method, whose τ is 0.8.
E Supplementary Experiment
15
GUNDAM A P REPRINT
We provide further insights with two case studies illustrated in Table 5. For the first case, considering the graph on
the left. All four models correctly indicated the presence of a path, assuring the connectivity between the two nodes.
However, the Vicuna-7B made an error in its reasoning process by suggesting a non-existent connection from node
17 to node 16. GPT-3.5, GPT-4, and GUNDAM provided correct answers with different valid paths. Although these
paths were not the shortest, they were indeed existent and verified the connectivity as required. In the second case, both
GPT-4 and GUNDAM accurately concluded the absence of any connecting path, whereas GPT-3.5 and Vicuna-7B
generated incorrect responses. GPT-3.5 mistakenly introduced a non-existent edge (6,4), and Vicuna-7B incorrectly
proposed an edge (6,5). In analyzing other error cases of GPT3.5, it was also found that GPT3.5 tends to construct
non-existent edges to make two nodes connected when no path exists in the graph.
15 3
2 12
6
9 1
2
6 8
Graph 3 0 0
10 16
1
17 13
5
5 11
7 4
Query Is there a path between node 15 and node 8? Is there a path between node 0 and node 4?
Yes, there is a path between node 0 and node 4 in the given
Yes, there is a path between node 15 and node 8
GPT3.5 graph. We can see that the following path connects these
in the provided graph. The path is as follows: 15-1-0-8
two nodes: 0-3-6-1-2-3-6-4.
Yes, there is a path between node 15 and node 8. No, there is no path between node 0 and node 4 in the given
GPT4
The path can be as follows: 15-3-17-8. graph. The nodes 0 and 4 are in separate disconnected subgraphs.
\***\1 Starting from node 15, we can follow the connections \***\1 Starting from node 0, we can follow the edges as follows:
Vicuna-7B to node 17, then to node 16, and finally to node 8. (0,6) (6,3) (3,2) (2,1) (1,6) (6,5) (5,4) Therefore, the answer is Yes,
Therefore, there is a path between node 15 and node 8. there is a path between node 0 and node 4.
Node 15 is connected to node 1, node 1 is connected to
GUNDAM-V node 0, node 0 is connected to node 8, we can follow the path: No, there is no path between node 0 and node 4.
15-1-0-8. Yes, there a path between node 15 and node 8.
1
\***\ indicates that unimportant generated content has been omitted.
16