(Update) Task2 AntGraph 1st
(Update) Task2 AntGraph 1st
KEYWORDS
Link Prediction, Gradient Boosting Decision Trees, Graph Learning,
WSDM Cup 2022 to the data analyses, we surprisingly find that removing the time
span information in prediction could also achieve satisfactory
1 INTRODUCTION performance.
As graphs are ubiquitously exist in a wide range of real-world • Subsequently, we introduce the data processing flow, enumer-
applications, many problems can be formulated as specific tasks ate several feature engineering methods ranging from network
over graphs. And link prediction [4], as one of the most important embedding to heuristic graph structure.
task in graph-structured datasets, is widely applied in biology [10], • Finally, we conduct comprehensive experiments on the competi-
recommendation [3, 14] and finance [12]. Meanwhile, real-world tion datasets, which show the effectiveness of our proposal, and
data is usually evolving over time, some following literature [11, exhaustive ablation studies also show the improvement of each
13] consider devising temporal graph learning models to uncover kind of feature.
temporal information. However, predicting the links on a temporal
graph is more non-trivial. WSDM Cup 2022 calls for solutions that Our source code are publicly available on GitHub1 .
predicting the probability of a link within a period of time. In this
paper, we will introduce the solution of AntGraph team, which 2 DATASETS
ranks the first of the competition (achieved AUC score of 0.666 In this section, we focus on the exploratory of datasets provided by
on dataset A and 0.902 on dataset B). And this technical report is the competition, and an in-depth analysis is presented, followed by
organized as following: the detailed introduction of evaluation metrics.
• First, we give some statistics on the datasets, do some exploratory
analyses and introduce the motivation of the method. According
∗ Corresponding author. 1 https://github.com/im0qianqian/WSDM2022TGP-AntGraph
WSDM Cup ’22, February 21–25, 2022, Arizona, USA Qian Zhao, Shuo Yang, Binbin Hu∗ , Zhiqiang Zhang, Yakun Wang, Yusong Chen, Jun Zhou, and Chuan Shi
Table 2: The analysis of the existence of same edges in the initial test set.
Table 3: The performance w.r.t. AUC of our native strategy set determines the ranking of the competition. In summary,
compared to the baseline model provided by the sponsor on we detailed all necessary statistics of two datasets in Table 1.
both initial and intermediate (Inter.) test set.
Table 4: Explore the maximum AUC without temporal infor- 3.2 Feature Engineering
mation.
3.2.1 LINE embedding. As concluded in the previous data analysis,
the first-order relation is of crucially importance in our link predic-
Initial test Initial test tion task. In order to capture such deep correlation between nodes
Description
(mode) (mean) in an more fine-grained manner, we introduce the LINE embed-
node pair (𝑤 .𝑜. edge type) 0.9040 0.9776 ding, an effective and efficient graph learning framework arbitrary
Dataset A graphs (undirected, directed, and/or weighted). In particular, LINE
node pair (𝑤 .𝑖. edge type) 0.9900 0.9997
is carefully designed preserves both the first-order and second-
node pair (𝑤 .𝑜. edge type) 0.8946 0.9795 order proximities, which is suitable to our scenarios to capture co-
Dataset B
node pair (𝑤 .𝑖. edge type) 0.9147 0.9875 occurrence relation. On the other hands, several heterogeneous [2]
and knowledge [1, 7, 9] graph representation based methods are
also promising ways for learning powerful representations, whereas
the LNIE experimentally achieves the best performance, shown in
follows:
following experiment part.
𝐴𝑈𝐶 − mean(𝐴𝑈𝐶)
𝑇 𝑠𝑐𝑜𝑟𝑒 = ∗ 0.1 + 0.5 (2) 3.2.2 Node crossing features. After we obtain the representations
std(𝐴𝑈𝐶)
for each node in graphs, we construct crossing features to further
𝑇 𝑆𝑐𝑜𝑟𝑒𝐴 + 𝑇 𝑆𝑐𝑜𝑟𝑒𝐵 reveal the correlation of each node pair. Specifically, we calculate
AverageOfTscore = (3) the similarity of node pairs w.r.t. LINE embeddings in datasets as the
2
node crossing features. Given a node pair (𝑢, 𝑣) with corresponding
where mean(𝐴𝑈 𝐶) and std(𝐴𝑈𝐶) represents the mean and stan- embedding 𝑒𝑢 and 𝑒 𝑣 , the similarity is calculated through the cosine
dard deviation of AUC of all participants. Clearly, an larger average operation (i.e., 𝑒𝑢 · 𝑒 𝑣 /||𝑒𝑢 || × ||𝑒 𝑣 ||) and the dot product (i.e., 𝑒𝑢 · 𝑒 𝑣 ):
of T-scores means a better performance.
3.2.3 Subgraph features. In addition, we also added the follow-
3 METHODOLOGY ing statistical features based on the graph structure to well help
In this section, we introduce our complete solution for large-scale downstream model capture high-order information:
temporal graph link prediction task, which consists of train data • Unitary feature w.r.t. individual nodes: i) The degree of this node;
construction component, feature engineering component and down- ii) The number of different nodes adjacent to this node; iii) The
stream model training component. In the following, we will zoom number of different edge types adjacent to this node.
into each well designed component. • Binary information w.r.t. node pairs: i) The number of one hop
paths between two nodes; ii) The number of two hop paths be-
3.1 Train Data Construction tween two nodes; iii) The number of different edge types between
As mentioned above, the goal of this competition is to predict two nodes.
whether an edge will exist between two nodes within a given time • Ternary information w.r.t. node pairs and edge types: The number
span, whereas each edge in the provided graphs is only associ- of occurrences of this triplet.
ated with a single timestamp. Hence, the inconsistent problem
between training and testing severely threatens the generalization 3.3 Catboost Model
of models. In addition, previous data analysis has concluded that
The link prediction task can be easily formulated as a binary classi-
this task may not benefit from involving timestamps, therefore, we
fication problem based on the extracted features from each (source
construct the train data without timestamps as follows:
node, relation, target) triple. On the other hands, gradient boosting
3.1.1 Negative sampling. For efficient training, we adopt the shuf- has prove its powerful capability in various applications for clas-
fling based sampling strategy to sample negative instance in batch, sification. Recently, catboost [6] has gained increasing popularity
rather then the whole node set. Moreover, the timestamps are ig- and attention due to its fast processing speed and high prediction
nored in our negative sampling process. In particular, our negative performance. We feed the train data (see Section 3.1) as well as
sampling process is detailed is follows: i) We denotes edges in the abundant features (see Section 3.2) into catboost model, and then
original graphs as the positive instance set, consisting of source utilize the produced scores as the final predictions.
nodes, target nodes and relations. ii) We only keep source nodes
unchanged, and randomly shuffle target nodes and relations to gen- 4 EXPERIMENTS
erate the negative instance set. iii) We combine the above positive
4.1 Overall Performance
and negative instance set, and uniformly sample a certain number
of instances to construct the final train set. Performance in the leaderbord. We present the results of top five
teams from the learderboard in Table 6. We observation that our
3.1.2 Removing redundant features. Firstly, we remove all time re- solution achieve the best performance in Dataset A and competitive
lated features, including timestamp, start time, end time. Moreover, performance in Dataset B. As the best performance achived w.r.t.
we remove the edge features for Dataset B, since these featues are the final ranking metric further indicating that our solution works
not avaiable for most of edges, i.e., the non-empty ratio is 6.67%. well on both kinds of data simultaneously.
WSDM Cup ’22, February 21–25, 2022, Arizona, USA Qian Zhao, Shuo Yang, Binbin Hu∗ , Zhiqiang Zhang, Yakun Wang, Yusong Chen, Jun Zhou, and Chuan Shi
Dataset A Dataset B
Model
Initial AUC Inter. AUC Initial AUC Inter. AUC
baseline 0.5110 0.5026 0.5100 0.5026
DeepWalk [5] 0.5352 0.5707 0.5246 0.4985
TransE [1] 0.5182 0.5614 0.6389 0.8903
RotatE [7] 0.5315 0.5736 0.6323 0.8981
ComplEx [9] 0.5514 0.5821 0.6359 0.9014
LINE [8] 0.6072 0.6320 0.6425 0.8905
catboost (raw input data) 0.6045 0.6222 0.5545 0.5869
+ LINE embedding 0.6377 (+5.49%) 0.6540 (+5.11%) 0.6399 (+15.40%) 0.9013 (+53.57%)
+ Subgraph features 0.6611 (+9.36%) 0.6673 (+7.25%) 0.5861 (+5.70%) 0.7561 (+28.83%)
+ LINE embedding + Subgraph features 0.6619 (+9.50%) 0.6659 (+7.02%) 0.6368 (+14.84%) 0.8978 (+52.97%)
+ LINE embedding + Node crossing features 0.6573 (+8.73%) 0.6673 (+7.25%) 0.6504 (+17.29%) 0.9001 (+53.37%)
+ All (submitted version) 0.6657 (+10.12%) 0.6671 (+7.22%) 0.6459 (+16.48%) 0.9028 (+53.83%)
2 https://github.com/dglai/WSDM2022-Challenge