(KDD 2023) All in One - Multi-Task Prompting For Graph Neural Networks
(KDD 2023) All in One - Multi-Task Prompting For Graph Neural Networks
2120
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan
Figure 1: Fine-tuning, Pre-training, and Prompting. Figure 2: Our graph prompt inspired by the language prompt.
2121
All in One: Multi-Task Prompting for Graph Neural Networks KDD ’23, August 6–10, 2023, Long Beach, CA, USA
2 BACKGROUND
Question
Graph Neural Networks. Graph neural networks (GNNs) have Answering
Graph-level Operations
presented powerful expressiveness in many graph-based applica- Sentiment
Classification
tions [10, 12, 15, 29] . The nature of most GNNs is to capture the un- Subgraph-level
Operations
derlying message-passing patterns for graph representation. To this
end, there are many effective neural network structures proposed Masked Edge-level
Node-level Operations
Prediction Operations
such as graph attention network (GAT) [32], graph convolution
network (GCN) [34], Graph Transformer [25]. Recent works also
consider how to make graph learning more adaptive when data (a) NLP tasks (b) graph tasks
2122
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan
encoders. When we treat the target node’s label as this induced denotes the set of prompt tokens and |P | is the number of tokens.
graph label, we can easily translate the node classification problem Each token 𝑝𝑖 ∈ P can be represented by a token vector p𝑖 ∈ R1×𝑑
into graph classification; Similarly, we present an induced graph for with the same size of node features in the input graph; Note that in
a pair of nodes in Figure 4b. Here, the pair of nodes can be treated practice, we usually have |P | ≪ 𝑁 and |P | ≪ 𝑑ℎ where 𝑑ℎ is the
as a positive edge if there is an edge connecting them, or a negative size of the hidden layer in the pre-trained graph model. With these
edge if not. This subgraph can be easily built by extending this node token vectors, the input graph can be reformulated by adding the
pair to their 𝜏 distance neighbors. We can reformulate the edge- 𝑗-th token to graph node 𝑣𝑖 (e.g., x̂𝑖 = x𝑖 + p 𝑗 ). Then, we replace
level task by assigning the graph label with the edge label of the the input features with the prompted features and send them to the
target node pair. Note that for unweighted graphs, the 𝜏 distance is pre-trained model for further processing.
equal to 𝜏-hop length; for weighted graphs, the 𝜏 distance refers to
the shortest path distance, where the induced graph can be easily 3.3.3 Token Structures. S = {(𝑝𝑖 , 𝑝 𝑗 )|𝑝𝑖 , 𝑝 𝑗 ∈ P} is the token
found by many efficient algorithms [1, 39]. structure denoted by pair-wise relations among tokens. Unlike the
NLP prompt, the token structure in the prompt graph is usually
implicit. To solve this problem, we propose three methods to design
the prompt token structures: (1) the first way is to learn tunable
parameters:
| P | −1
A= ∪ {𝑎𝑖 𝑗 }
𝑖=1
𝑗 =𝑖+1
2123
All in One: Multi-Task Prompting for Graph Neural Networks KDD ’23, August 6–10, 2023, Long Beach, CA, USA
2124
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan
the inserting pattern defined in section 3.3; G is the original graph, the source task head unchanged. We can even select some specific
and G𝑝 is the prompt graph, then we can learn an optimal prompt pretext and customize the details of our prompt without any tuned
graph G𝑝∗ to extend Equation (5) as follows: task head. Here we present a case that does not need to tune a task
head and we evaluate its feasibility in section 4.4.
𝜑 ∗ 𝜓 (G, G𝑝∗ ) = 𝜑 ∗ (g(A, X)) + 𝑂 𝑝𝜑
∗
(6)
Prompt without Task Head Tuning:
By efficient tuning, the new error bound ∗
can be further re-
𝑂 𝑝𝜑 Pretext: GraphCL [36], a graph contrastive learning task
∗ can be
duced. In section 4.6, we empirically demonstrate that 𝑂 𝑝𝜑 that tries to maximize the agreement between a pair of
significantly smaller than 𝑂 𝑝𝜑 via efficient training. That means views from the same graph.
our method supports more flexible transformations on graphs to Downstream Tasks: node/edge/graph classification.
match various pre-training strategies. Prompt Answer: node classification. Assume there are 𝑘
categories for the nodes. We design the prompt graph with
3.5.3 Efficiency. Assume an input graph has 𝑁 nodes and 𝑀 edges 𝑘 sub-graphs (a.k.a sub-prompts) where each sub-graph has
and the prompt graph has 𝑛 tokens with 𝑚 edges. Let the graph 𝑛 tokens. Each sub-graph corresponds to one node category.
model contain 𝐿 layers and the maximum dimension of one layer be Then we can generate 𝑘 graph views for all input graphs.
𝑑. The parameter complexity of the prompt graph is only 𝑂 (𝑛𝑑). In We classify the target node with label ℓ (ℓ = 1, 2, · · · , 𝑘) if
contrast, some typical graph models (e.g., GAT [32]) usually contain the ℓ-th graph view is closest to the induced graph. It is
𝑂 (𝐿𝐾𝑑 2 + 𝐿𝐾𝑑) parameters to generate node embedding and addi- similar to edge/graph classification.
tional 𝑂 (𝐾𝑑) parameters to obtain the whole graph representation
Interestingly, by shrinking the prompt graph as isolate tokens
(𝐾 is the multi-head number). The parameters may be even larger in
aligned with node classes and replacing the induced graphs with
other graph neural networks (e.g., graph transformer [37]). In our
the whole graph, our prompt format can degenerate to GPPT, which
prompt learning framework, we only need to tune the prompt with
means we can also leverage edge-level pretext for node classifica-
the pre-trained graph model frozen, making the training process
tion. Since this format is exactly the same as GPPT, we will not
converge faster than traditional transfer tuning.
discuss it anymore. Instead, we directly compare GPPT on node
For the time complexity, a typical graph model (e.g., GCN [34])
classification with our method.
usually needs 𝑂 (𝐿𝑁𝑑 2+𝐿𝑀𝑑+𝑁𝑑) time to generate node embedding
via message passing and then obtain the whole graph representation
4 EVALUATION
(e.g., 𝑂 (𝑁𝑑) for summation pooling). By inserting the prompt into
the original graph, the total time is 𝑂 (𝐿(𝑛+𝑁 )𝑑 2+𝐿(𝑚+𝑀)𝑑+(𝑛+𝑁 )𝑑). In this section, we extensively evaluate our method with other ap-
Compared with the original time, the additional time cost is only proaches on node-level, edge-level, and graph-level tasks of graphs.
𝑂 (𝐿𝑛𝑑 2 +𝐿𝑚𝑑 +𝑛𝑑) where 𝑛 ≪ 𝑑, 𝑛 ≪ 𝑁 , 𝑚 ≪ 𝑀. In particular, we wish to answer the following research questions:
Besides the efficient parameter and time cost, our work is also Q1: How effective is our method under the few-shot learning back-
memory friendly. Taking node classification as an example, the ground for multiple graph tasks? Q2: How adaptable is our method
memory cost of a graph model largely includes parameters, graph when transferred to other domains or tasks? Q3: How do the main
features, and graph structure information. As previously discussed, components of our method impact the performance? Q4: How effi-
our method only needs to cache the prompt parameters, which are cient is our model compared with traditional approaches? Q5: How
far smaller than the original graph model. For the graph features powerful is our method when we manipulate graphs?
and structures, traditional methods usually need to feed the whole
graph into a graph model, which needs huge memory to cache these 4.1 Experimental Settings
contents. However, we only need to feed an induced graph to the 4.1.1 Datasets. : We compare our methods with other approaches
model for each node, the size of which is usually far smaller than on five public datasets including Cora [34], CiteSeer [34], Reddit [8],
the original graph. Notice that in many real-world applications, we Amazon [23], and Pubmed [34]. Detailed statistics are presented
are often interested in only a few parts of the total nodes, which in Table 1 where the last column refers to the number of node
means our method can stop timely if there is no more node to be classes. To conduct edge-level and graph-level tasks, we sample
predicted and we do not need to propagate messages on the whole edges and subgraphs from the original data where the label of an
graph either. This is particularly helpful for large-scale data. edge is decided by its two endpoints and the subgraph label follows
the majority of the subgraph nodes. For example, if nodes have 3
3.5.4 Compatibility. Unlike GPPT, which can only use binary edge different classes, say 𝑐 1, 𝑐 2, 𝑐 3 , then edge-level tasks contain at least
prediction as a pretext, and is only applicable for node classifica- 6 categories (𝑐 1, 𝑐 2, 𝑐 3, 𝑐 1𝑐 2, 𝑐 1𝑐 3, 𝑐 2𝑐 3 ). We also evaluate additional
tion as downstream tasks, our framework can support node-level, graph classification and link prediction on more specialized datasets
edge-level, and graph-level downstream tasks, and adopt various where the graph label and the link label are inborn and not related
graph-level pretexts with only a few steps of tuning. Besides, when to any node (see Appendix A).
transferring the model to different tasks, traditional approaches usu-
ally need to additionally tune a task head. In contrast, our method 4.1.2 Approaches. Compared approaches are from three categories:
focuses on the input data manipulation and it relies less on the (1) Supervised methods: these methods directly train a GNN
downstream tasks. This means we have a larger tolerance for the model on a specific task and then directly infer the result. We here
task head. For example, in section 4.3, we study the transferability take three famous GNN models including GAT [32], GCN [34],
from other domains or tasks but we only adapt our prompt, leaving and Graph Transformer [25] (short as GT). These GNN models
2125
All in One: Multi-Task Prompting for Graph Neural Networks KDD ’23, August 6–10, 2023, Long Beach, CA, USA
Table 1: Statistics of datasets. to 12.26% on edge-level tasks, and 0.14% to 10.77% on graph-level
tasks. In particular, we also compared our node-level performance
Dataset #Nodes #Edges #Features #Labels with the previously mentioned node-level prompt model GPPT in
Cora 2,708 5,429 1,433 7 Table 2. Kindly note that our experiment settings are totally dif-
CiteSeer 3,327 9,104 3,703 6 ferent from GPPT. In GPPT, they study the few-shot problem by
Reddit 232,965 23,213,838 602 41 masking 30% or 50% data labels. However, in our paper, we propose
Amazon 13,752 491,722 767 10 a more challenging problem: how does the model perform if we
Pubmed 19,717 88,648 500 3 further reduce the label data? So in our experiment, each class only
has 100 labeled samples. This different setting makes our labeled
ratio approximately only 25% on Cora, 18% on CiteSeer, 1.7% on
are also included as the backbones of our prompt methods. (2) Reddit, 7.3% on Amazon, and 1.5% on Pubmed, which are far less
Pre-training with fine-tuning: These methods first pre-train a than the reported GPPT (50% labeled).
GNN model in a self-supervised way such as GraphCL [36] and
SimGRACE [35], then the pre-trained model will be fine-tuned for 4.3 Transferability Analysis (RQ2)
a new downstream task. (3) Prompt methods: With a pre-trained To evaluate the transferability, we compared our method with the
model frozen and a learnable prompt graph, our prompt method hard transfer method and the fine-tuning method. Here the hard
aims to change the input graph and reformulate the downstream transfer method means we seek the source task model which has
task to fit the pre-training strategies. the same task head as the target task and then we directly conduct
4.1.3 Implementations. We set the number of graph neural layers the model inference on the new task. The fine-tune method means
as 2 with a hidden dimension of 100. To study the transferability we load the source task model and then tune the task head for the
across different graph data, we use SVD (Singular Value Decompo- new task. We evaluate the transferability from two perspectives: (1)
sition) to reduce the initial features to 100 dimensions. The token how effectively is the model transferred to different tasks within
number of our prompt graph is set as 10. We also discuss the impact the same domain? and (2) how effectively is the model transferred
of token numbers in section 4.4 where we change the token number to different domains?
from 1 to 20. We use the Adam optimizer for all approaches. The
4.3.1 Transferability to Different Level Tasks. Here we pre-train
learning rate is set as 0.001 for most datasets. In the meta-learning
the graph neural network on Amazon, then conduct the model on
stage, we split all the node-level, edge-level, and graph-level tasks
two source tasks (graph level and node level), and further evaluate
randomly in 1:1 for meta-training and meta-testing. Reported re-
the performance on the target task (edge level). For simplicity, both
sults are averaged on all tested tasks. More implementation details
source tasks and the target task are built as binary classifications
are shown in Appendix A, in which we also analyze the perfor-
with 1 : 1 positive and negative samples (we randomly select a class
mance on more datasets and more kinds of tasks such as regression,
as the positive label and sample negatives from the rest). We report
link prediction, and so on.
the results in Table 3, from which we have two observations: First,
our prompt method significantly outperforms the other approaches
4.2 Multi-Task Performance with Few-shot and the prediction results make sense. In contrast, the problem of
Learning Settings (RQ1) the hard transfer method is that the source model sometimes can
We compared our prompt-based methods with other mainstream not well decide on the target tasks because the target classes may
training schemes on node-level, edge-level, and graph-level tasks be far away from the source classes. This may even cause negative
under the few-shot setting. We repeat the evaluation 5 times and transfer results (results that are lower than random guess). In most
report the average results in Table 2, Table 12 (Appendix A), and cases, the fine-tuning method can output meaningful results with
Table 13 (Appendix A). From the results, we can observe that most a few steps of tuning but it can still encounter a negative transfer
supervised methods are very hard to achieve better performance problem. Second, the graph-level task has better adaptability than
compared with pre-train methods and prompt methods. This is the node-level task for the edge-level target, which is in line with
because the empirical annotations required by supervised frame- our previous intuition presented in Figure 3 (section 3.2).
works in the few-shot setting are very limited, leading to poor
performance. In contrast, pre-training approaches contain more 4.3.2 Transferability to Different Domains. We also conduct the
prior knowledge, making the graph model rely less on data labels. model on Amazon and PubMed as source domains, then load the
However, to achieve better results on a specific task, we usually model states from these source domains and report the performance
need to carefully select an appropriate pre-training approach and on the target domain (Cora). Since different datasets have various
carefully tune the model to match the target task, but this huge input feature dimensions, we here use SVD to unify input features
effort is not ensured to be applicable to other tasks. The gap be- from all domains as 100 dimensions. Results are shown in Table 4,
tween pre-training strategies and downstream tasks is still very from which we can find that the good transferability of our prompt
large, making the graph model very hard to transfer knowledge also exists when we deal with different domains.
on multi-task settings (we further discuss the transferability in sec-
tion 4.3.) Compared with pre-training approaches, our solutions 4.4 Ablation Study (RQ3)
further improve the compatibility of graph models. The reported In this section, we compare our complete framework with four
improvements range from 1.10% to 8.81% on node-level tasks, 1.28% variants: “w/o meta” is the prompt method without meta-learning
2126
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan
Table 2: Node-level performance (%) with 100-shot setting. IMP (%): the average improvement of prompt over the rest.
Table 3: Transferability (%) on Amazon from different level full w/o meta w/o h w/o token structure w/o inserting
tasks spaces. Source tasks: graph-level tasks and node-level 100.00
tasks. Target task: edge-level tasks.
90.00
step; “w/o h” is our method without task head tuning, which is 4.5 Efficiency Analysis (RQ4)
previously introduced in section 3.5.4; “w/o token structure” is Figure 6 presents the impact of increasing token number on the
the prompt where all the tokens are treated as isolated without model performance, from which we can find that most tasks can
any inner connection; and “w/o inserting” is the prompt without reach satisfactory performance with very limited tokens, making
2127
All in One: Multi-Task Prompting for Graph Neural Networks KDD ’23, August 6–10, 2023, Long Beach, CA, USA
the complexity of the prompt graph very small. The limited to- supervised
ken numbers make our tunable parameter space far smaller than 0.7 pretrain
prompt
traditional methods, which can be seen in Table 5. This means
our method can be efficiently trained with a few steps of tuning. 0.6
Loss
As shown in Figure 7, the prompt-based method converges faster
than traditional pre-train and supervised methods, which further 0.5
suggests the efficiency advantages of our method.
0.4
Table 5: Tunable parameters comparison. RED (%): average
reduction of the prompt method to others. 0 10 20 30 40 50
Epoch
Methods Cora CiteSeer Reddit Amazon Pubmed RED (%)
GAT ∼ 155K ∼ 382K ∼ 75K ∼ 88K ∼ 61K 95.4↓ Figure 7: Training losses with epochs. Mean values and 65%
GCN ∼ 154K ∼ 381K ∼ 75K ∼ 88K ∼ 61K 95.4↓ confidence intervals by 5 repeats with different seeds.
GT ∼ 615K ∼ 1.52M ∼ 286K ∼ 349K ∼ 241K 98.8↓
prompt ∼ 7K ∼ 19K ∼ 3K ∼ 4K ∼ 3K –
Table 6: Error bound discussed by section 3.5.2 RED (%): aver-
age reduction of each method to the original error.
90.00
Naive Prompt
1 0.8710 0.5241 2.0835 66.70↓
(Equation 5)
Our Prompt Graph 3 0.0875 0.2337 0.6542 90.66↓
(with token, structure, 5 0.0685 0.1513 0.4372 93.71↓
80.00 and inserting patterns) 10 0.0859 0.1144 0.2600 95.59↓
1 5 10 15 20
10 10
2128
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan
REFERENCES advances in natural language processing via large pre-trained language models:
[1] Takuya Akiba, Takanori Hayashi, Nozomi Nori, Yoichi Iwata, and Yuichi Yoshida. A survey. arXiv preprint arXiv:2111.01243 (2021).
2015. Efficient top-k shortest-path distance queries on large networks by pruned [21] Yujia Qin, Xiaozhi Wang, Yusheng Su, Yankai Lin, Ning Ding, Zhiyuan Liu,
landmark labeling. In Proceedings of the AAAI Conference on Artificial Intelligence, Juanzi Li, Lei Hou, Peng Li, Maosong Sun, et al. 2021. Exploring low-dimensional
Vol. 29. intrinsic task subspace via prompt tuning. arXiv preprint arXiv:2110.07867 (2021).
[2] Yunsheng Bai, Hao Ding, Yang Qiao, Agustin Marinovic, Ken Gu, Ting Chen, [22] Anna Rogers, Olga Kovaleva, Matthew Downey, and Anna Rumshisky. 2020.
Yizhou Sun, and Wei Wang. 2019. Unsupervised inductive graph-level represen- Getting closer to AI complete question answering: A set of prerequisite real tasks.
tation learning via graph-graph proximity. In Proceedings of the 28th International In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 8722–8731.
[23] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan
Joint Conference on Artificial Intelligence. 1988–1994.
Günnemann. 2018. Pitfalls of graph neural network evaluation. arXiv preprint
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
arXiv:1811.05868 (2018).
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
[24] Zheyan Shen, Jiashuo Liu, Yue He, Xingxuan Zhang, Renzhe Xu, Han Yu, and
Askell, et al. 2020. Language models are few-shot learners. Advances in neural
Peng Cui. 2021. Towards out-of-distribution generalization: A survey. arXiv
information processing systems 33 (2020), 1877–1901.
preprint arXiv:2108.13624 (2021).
[4] Hongxu Chen, Hongzhi Yin, Xiangguo Sun, Tong Chen, Bogdan Gabrys, and
[25] Yunsheng Shi, Zhengjie Huang, Shikun Feng, Hui Zhong, Wenjin Wang, and Yu
Katarzyna Musial. 2020. Multi-level graph convolutional networks for cross-
Sun. 2020. Masked label prediction: Unified message passing model for semi-
platform anchor link prediction. In Proceedings of the 26th ACM SIGKDD interna-
supervised classification. arXiv preprint arXiv:2009.03509 (2020).
tional conference on knowledge discovery & data mining. 1503–1511.
[26] Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer
[5] Junru Chen, Yang Yang, Tao Yu, Yingying Fan, Xiaolong Mo, and Carl Yang. 2022.
Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with
BrainNet: Epileptic Wave Detection from SEEG with Hierarchical Graph Diffusion
Automatically Generated Prompts. In Empirical Methods in Natural Language
Learning. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge
Processing (EMNLP).
Discovery and Data Mining. 2741–2751.
[27] Mingchen Sun, Kaixiong Zhou, Xin He, Ying Wang, and Xin Wang. 2022. GPPT:
[6] Taoran Fang, Yunchao Zhang, Yang Yang, and Chunping Wang. 2022. Prompt
Graph pre-training and prompt tuning to generalize graph neural networks. In
Tuning for Graph Neural Networks. arXiv preprint arXiv:2209.15240 (2022).
Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and
[7] Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making Pre-trained Language
Data Mining. 1717–1727.
Models Better Few-shot Learners. In Proceedings of the 59th Annual Meeting of
[28] Xiangguo Sun, Hong Cheng, Bo Liu, Jia Li, Hongyang Chen, Guandong Xu, and
the Association for Computational Linguistics. 3816–3830.
Hongzhi Yin. 2023. Self-supervised Hypergraph Representation Learning for
[8] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation
Sociological Analysis. IEEE Transactions on Knowledge and Data Engineering
learning on large graphs. Advances in neural information processing systems 30
(2023).
(2017).
[29] Xiangguo Sun, Hongzhi Yin, Bo Liu, Hongxu Chen, Qing Meng, Wang Han, and
[9] Bowen Hao, Hongzhi Yin, Jing Zhang, Cuiping Li, and Hong Chen. 2022. A Multi-
Jiuxin Cao. 2021. Multi-level hyperedge distillation for social linking prediction
Strategy based Pre-Training Method for Cold-Start Recommendation. ACM
on sparsely observed networks. In Proceedings of the Web Conference 2021. 2934–
Transactions on Information Systems (2022).
2945.
[10] Zhenyu Hou, Xiao Liu, Yukuo Cen, Yuxiao Dong, Hongxia Yang, Chunjie Wang,
[30] Xiangguo Sun, Hongzhi Yin, Bo Liu, Qing Meng, Jiuxin Cao, Alexander Zhou,
and Jie Tang. 2022. GraphMAE: Self-Supervised Masked Graph Autoencoders.
and Hongxu Chen. 2022. Structure Learning Via Meta-Hyperedge for Dynamic
In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and
Rumor Detection. IEEE Transactions on Knowledge and Data Engineering (2022).
Data Mining. 594–604.
[31] Jianheng Tang, Jiajin Li, Ziqi Gao, and Jia Li. 2022. Rethinking Graph Neu-
[11] W Hu, B Liu, J Gomes, M Zitnik, P Liang, V Pande, and J Leskovec. 2020. Strategies
ral Networks for Anomaly Detection. In International Conference on Machine
For Pre-training Graph Neural Networks. In International Conference on Learning
Learning.
Representations (ICLR).
[32] Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro
[12] Cheng Jiashun, Li Man, Li Jia, and Fugee Tsung. 2023. Wiener Graph Deconvolu-
Liò, and Yoshua Bengio. 2018. Graph Attention Networks. In International Con-
tional Network Improves Graph Self-Supervised Learning. In Proceedings of the
ference on Learning Representations.
AAAI conference on artificial intelligence.
[33] Liyuan Wang, Mingtian Zhang, Zhongfan Jia, Qian Li, Chenglong Bao, Kaisheng
[13] Wei Jin, Tyler Derr, Haochen Liu, Yiqi Wang, Suhang Wang, Zitao Liu, and Jiliang
Ma, Jun Zhu, and Yi Zhong. 2021. Afec: Active forgetting of negative transfer in
Tang. 2020. Self-supervised learning on graphs: Deep insights and new direction.
continual learning. Advances in Neural Information Processing Systems 34 (2021),
arXiv preprint arXiv:2006.10141 (2020).
22379–22391.
[14] Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for
[34] Max Welling and Thomas N Kipf. 2016. Semi-supervised classification with graph
Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on
convolutional networks. In J. International Conference on Learning Representations
Empirical Methods in Natural Language Processing. 3045–3059.
(ICLR 2017).
[15] Jia Li, Zhichao Han, Hong Cheng, Jiao Su, Pengyun Wang, Jianfeng Zhang, and
[35] Jun Xia, Lirong Wu, Jintao Chen, Bozhen Hu, and Stan Z Li. 2022. SimGRACE: A
Lujia Pan. 2019. Predicting path failure in time-evolving graphs. In Proceedings of
Simple Framework for Graph Contrastive Learning without Data Augmentation.
the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data
In Proceedings of the ACM Web Conference 2022. 1070–1079.
Mining. 1279–1289.
[36] Yuning You, Tianlong Chen, Yongduo Sui, Ting Chen, Zhangyang Wang, and
[16] Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous
Yang Shen. 2020. Graph contrastive learning with augmentations. Advances in
Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Associa-
Neural Information Processing Systems 33 (2020), 5812–5823.
tion for Computational Linguistics. 4582–4597.
[37] Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, and Hyunwoo J Kim.
[17] Yan Ling, Jianfei Yu, and Rui Xia. 2022. Vision-Language Pre-Training for Mul-
2019. Graph transformer networks. Advances in neural information processing
timodal Aspect-Based Sentiment Analysis. In Proceedings of the 60th Annual
systems 32 (2019).
Meeting of the Association for Computational Linguistics. 2149–2159.
[38] Zexuan Zhong, Dan Friedman, and Danqi Chen. 2021. Factual Probing Is [MASK]:
[18] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Gra-
Learning vs. Learning to Recall. In NAACL-HLT. 5017–5033.
ham Neubig. 2021. Pre-train, prompt, and predict: A systematic survey of prompt-
[39] Andy Diwen Zhu, Xiaokui Xiao, Sibo Wang, and Wenqing Lin. 2013. Efficient
ing methods in natural language processing. arXiv preprint arXiv:2107.13586
single-source shortest path and distance queries on large graphs. In Proceedings
(2021).
of the 19th ACM SIGKDD international conference on Knowledge discovery and
[19] Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie
data mining. 998–1006.
Tang. 2022. P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across
[40] Yanqiao Zhu, Yichen Xu, Feng Yu, Qiang Liu, Shu Wu, and Liang Wang. 2021.
Scales and Tasks. In Proceedings of the 60th Annual Meeting of the Association for
Graph contrastive learning with adaptive augmentation. In Proceedings of the
Computational Linguistics (Volume 2: Short Papers). 61–68.
Web Conference 2021. 2069–2080.
[20] Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu
Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heinz, and Dan Roth. 2021. Recent
2129
All in One: Multi-Task Prompting for Graph Neural Networks KDD ’23, August 6–10, 2023, Long Beach, CA, USA
Table 9: Additional graph-level classification. In particular, Movielens contains user’s rating scores to the movies,
each edge in which has a score value ranging from 0 to 5. QM9 is a
ProteinsFull (100 shots) ENZYMES (50 shots) molecule graph dataset where each graph has 19 regression targets,
Methods
Acc (%) Macro F1 (%) Acc (%) Macro F1 (%) which are treated as graph-level multi-output regression. Person-
Supervised 66.64 65.03 31.33 30.25 alityCafe and Facebook datasets are used to test the performance
Pre-train + Fine-tune 66.50 66.43 34.67 33.94 of link prediction, both of which are social networks where edges
Prompt 70.50 70.17 35.00 34.92 denote the following/quoting relations.
Prompt w/o h 68.50 68.50 36.67 34.05 Multi-label v.s. Multi-class Classification In the main experi-
ments, we treat the classification task as a multi-label problem. Here
Table 10: Graph/edge-level regression with few-shot settings. we present the experimental results under a multi-class setting. As
reported in Table 8, our prompt-based method still outperforms the
rest methods.
Tasks Graph Regression Edge Regression
Additional Graph-level Classification Here, we evaluate the
Datasets QM9 (100 shots) MovieLens (100 shots)
Methods MAE MSE MAE MSE graph-level classification performance where the graph label is not
impacted by nodes’ attributes. As shown in Table 9, our method is
Supervised 0.3006 0.1327 0.2285 0.0895 more effective in the multi-class graph classification, especially in
Pre-train + Fine-tune 0.1539 0.0351 0.2171 0.0774 the few-shot setting.
Prompt 0.1384 0.0295 0.1949 0.0620 Edge/Graph-level Regression Beyond classification tasks, our
Prompt w/o h 0.1424 0.0341 0.2120 0.0744 method can also support to improve graph models on regression
tasks. Here, we evaluate the regression performance of both graph-
Table 7: Statistics of Additional Datasets level (QM9) and edge-level (MovieLens) datasets by MAE (mean
absolute error) and MSE (mean squared error). We only feed 100-
shot edge induced graphs for the model and the results are shown
Dataset #Nodes #Edges #Features #Labels #Graphs in Table 10, from which we can observe that our prompt-based
ENZYMES 19,580 74,564 21 6 600 methods outperform traditional approaches.
ProteinsFull 43,471 162,088 32 2 1,113 Link Prediction Beyond edge classification, link prediction is
Movielens 10,352 100,836 100 - 1 also a widely studied problem in the graph learning area. Here, the
QM9 2,333,625 4,823,498 16 - 129,433 edges are split into three parts: (1) 80% of the edges are for message
passing only. (2) 10% of the rest edges as the supervision training
PersonalityCafe 100,340 3,788,032 100 0 1 set. and (3) the rest edges as the testing set. For each edge in the
Facebook 4,039 88,234 1,283 0 1 training set and the testing set, we treat these edges as positive
samples and sample non-adjacent nodes as negative samples. We
Table 8: Multi-class node classification (100-shots) generate the edge-induced graph for these node pairs according to
the first part edges. The graph label is assigned as positive if the
Cora CiteSeer node pairs have a positive edge and vice versa. To further evaluate
Methods
Acc (%) Macro F1 (%) Acc (%) Macro F1 (%) our method’s potential in the extremely limited setting, we only
Supervised 74.11 73.26 77.33 77.64 sample 100 positive edges from the training set to train our model.
Pre-train and Fine-tune 77.97 77.63 79.67 79.83
In the testing stage, each positive edge has 100 negative edges.
We evaluate the performance by MRR (mean reciprocal rank), and
Prompt 80.12 79.75 80.50 80.65 Hit Ratio@ 1, 5, 10. Results from Table 11 demonstrate that the
Prompt w/o h 78.55 78.18 80.00 80.05
performance of our prompt-based method still keeps the best in
Reported Acc of GPPT most cases.
77.16 - 65.81 -
(Label Ratio 50%)
Table 11: Evaluation on link prediction (100-shot settings)
A APPENDIX
Datasets PersonalityCafe Facebook
In this section, we supplement more experiments to evaluate the Methods MRR Hit@1 Hit@5 Hit@10 MRR Hit@1 Hit@5 Hit@10
effectiveness of our framework further. The source code is publicly Supervised 0.18 0.04 0.24 0.56 0.13 0.06 0.17 0.35
available at https://anonymous.4open.science/r/mpg Pre-train
Additional Datasets Besides the datasets mentioned in the 0.13 0.05 0.12 0.34 0.10 0.02 0.16 0.33
+ Fine-tune
main experiments of our paper, we here supplement more datasets Prompt 0.20 0.07 0.32 0.60 0.19 0.10 0.23 0.39
in Table 7 to further evaluate the effectiveness of our framework. Prompt w/o h 0.20 0.06 0.30 0.50 0.15 0.09 0.15 0.33
Specifically, ENZYMES and ProteinsFull are two molecule/protein ∼ 0.003% (training) ∼ 0.1% (training)
Label Ratio
datasets that are used in our additional graph-level classification ∼ 80%(message passing) ∼ 80%(message passing)
tasks. Movielens and QM9 are used to evaluate the performance of
our method on edge-level and graph-level regression, respectively.
2130
KDD ’23, August 6–10, 2023, Long Beach, CA, USA Xiangguo Sun, Hong Cheng, Jia Li, Bo Liu, and Jihong Guan
Table 12: Edge-level performance (%) with 100-shot setting. IMP (%): the average improvement of prompt over the rest.
Table 13: Graph-level performance (%) with 100-shot setting. IMP (%): the average improvement of prompt over the rest.
2131