AdaNS Adaptive Negative Sampling For Unsupervised Graph Representation Learning
AdaNS Adaptive Negative Sampling For Unsupervised Graph Representation Learning
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
a r t i c l e i n f o a b s t r a c t
Article history: Recently, unsupervised graph representation learning has attracted considerable attention through ef-
Received 8 February 2021 fectively encoding graph-structured data without semantic annotations. To accelerate its training, noise
Revised 25 November 2022
contrastive estimation (NCE) samples uniformly negative examples to fit an unnormalized graph model.
Accepted 14 December 2022
However, this uniform sampling strategy may easily lead to slow convergence, even the vanishing gra-
Available online 15 December 2022
dient problem. In this paper, we theoretically show that sampling those hard negatives close to the cur-
Keywords: rent anchor can relieve the above difficulties. With this finding, we then propose an Adaptive Negative
Graph representation learning Sampling strategy, namely AdaNS, which efficiently samples the hard negatives from the mixing distribu-
Negative sampling tion regarding the dimensional elements of the current node representation. Experiments show that our
Noise contrastive estimation AdaNS sampling strategy applied on top of representative unsupervised models, e.g., DeepWalk, Graph-
SAGE, can outperform the existing negative sampling strategies in the tasks of node classification and
visualization. This also further demonstrates that sampling those hard negatives can bring performance
improvements for learning the node representations.
© 2022 Elsevier Ltd. All rights reserved.
1. Introduction encoder to increase the score on ǣrealǥ input and decrease the
score on ǣfakeǥ input. Essentially, contrastive methods are de-
With the explosive generation of graph-structured data in the veloped as estimators, which optimize the encoder to maximize
real world, graph representation learning methods that learn la- the compatibility between the representations of positive examples
tent, informative, and low-dimensional representations for nodes and push the representations of negative examples apart. However,
have been rapidly developed in recent years [1–3]. Among them, it is impractical to optimize straightforwardly, unless the compat-
supervised learning methods have achieved desirable performance ibility metric is unnormalized. Negative sampling [8], as a simpli-
in several graph-related tasks relying on extensive manual seman- fied version of Noise Contrastive Estimation (NCE) [9,10], is a fea-
tic annotations [2,4]. As it is impractical to obtain large-scale an- sible strategy for approximate optimization, which simplifies the
notations, there are obstacles to the scaling up of the supervised estimation of the normalized compatibility as a logistic regression
methods. In response, we have witnessed an increasing interest in problem that discriminates between positive and negative samples.
developing graph representation learning without semantic anno- Therefore, the contrastive method with negative sampling is de-
tations, also known as unsupervised graph representation learning veloped as a discriminator. Based on whether the encoder in the
[3,5–7]. discriminator employs a single-hidden-layer forward neural net-
Recent unsupervised graph representation learning methods us- work or a message-passing scheme, contrastive methods can be
ing contrastive objectives have achieved remarkable results [1,3]. categorized into two groups: shallow network embedding meth-
Contrastive methods comprise a trainable encoder and a pair ods [7,11,12] and Graph Neural Networks (GNNs) embedding meth-
of samplers for positive and negative examples. Specifically, a ods [2,3,5]. Sampling robust and generic positive nodes have been
contrastive method may employ a scoring function, training the the main focus for improving the performance of shallow network
embedding methods in recent years [1,7,12]. Meanwhile, the line
of GNNs is dedicated to designing advanced encoders [2,5]. An-
∗
Corresponding authors. other critical component, namely the negative sampling strategy,
E-mail addresses: [email protected] (Y. Wang), [email protected] is not sufficiently explored. We argue that the strategy in which
(L. Hu), [email protected] (W. Gao), [email protected] (X. Cao),
we choose negative samples can significantly affect the quality of
[email protected] (Y. Chang).
https://doi.org/10.1016/j.patcog.2022.109266
0031-3203/© 2022 Elsevier Ltd. All rights reserved.
Y. Wang, L. Hu, W. Gao et al. Pattern Recognition 136 (2023) 109266
the representations. For instance, distinguishing a pair of similar 2.1. Graph Representation Learning
samples gives completely different feedback to the encoder than
distinguishing a pair of absolutely unrelated samples. More fine- A wide variety of graph representation learning models have
grained discrimination may benefit the expressiveness of the node been proposed in the past few years, which fall into two cate-
representations. gories: traditional shallow network embedding methods and graph
Since the negative sampling is derived from the simplification neural networks (GNNs). Shallow network embedding methods at-
of NCE, the uniform sampling strategy is naturally followed. How- tempt to optimize the node embeddings as parameters by min-
ever, the uniform sampling strategy suffers seriously from slow imizing a reconstruction error, and this type of model is de-
convergence, even the vanishing gradient problem, which prevents voted to studying the sampling strategy of positive node pairs.
the model from achieving the desired representations [13]. To re- Initially proposed by Perozzi et al. [7], DeepWalk employs ran-
lieve the issues caused by the uniform sampling strategy, model- dom walk sequences to explore the structural information in a
ing adaptive negative sampling distributions based on the current graph and then learns node embeddings based on the skip-gram
training process emerges as a solution for sampling high-quality model [8] by maximizing the log-likelihood of the ǣcontextǥ nodes
negative examples [14–16]. In essence, there are several significant within a fixed window on the sequences for the given node. LINE
limitations. First, common adaptive sampling strategies based on [11] extends DeepWalk with both first- and second-order neigh-
generative adversarial network (GAN) architectures with extra gen- bor nodes sampled as positive node pairs and first formally ap-
erators introduce a myriad of parameters, which significantly limits plies negative sampling for graph representation learning. After
the scaling to big graphs. Moreover, even if adaptive negative sam- that, Node2vec [1] proposes a biased random walk strategy consid-
pling may relieve the vanishing gradient problem, in practice its ering both depth-first and breadth-first search strategies, based on
impact on the expressiveness of the yielded node representations which the sampling of positive node pairs can be flexibly adjusted
in downstream tasks is unknown. for various tasks and graphs. SDNE [17] employs a deep autoen-
In this paper, we theoretically prove that adaptively sampling coder to capture the high non-linearity in graphs. AROPE [18] fur-
hard negatives close to the anchor can relieve the vanishing gradi- ther captures the arbitrary-order node relationships by performing
ent problem and speed up the model training. Then, we propose eigendecomposition on the adjacency matrix. Furthermore, there
an efficient Adaptive Negative Sampling strategy, named AdaNS, are some models for sampling positive node pairs that take ad-
which exploits the mixing probability of distributions with respect vantage of other node properties on the graph, such as Personal-
to dimensional elements to sample negative nodes efficiently. We ized PageRank [19,20], diffusion patterns [21,22], structural roles
then apply the proposed AdaNS on representative unsupervised [12,22], and adjacency matrices [23,24]. Rather than learning para-
graph representation learning models, namely DeepWalk [7] and metric embeddings, GNNs learn mappings from graph structure
GraphSAGE [5], to sample negatives, yielding informative represen- and node features into embeddings, which are trained end-to-end
tations for downstream node classification and visualization tasks. supervised or semi-supervised by neural network parametrization
Extensive experiments are conducted to evaluate the proposed [2,25]. The original GCN algorithm [2] is proposed to adopt the lo-
negative sampling strategy. We experimentally find that sampling calized 1-step spectral convolution to design the message-passing
hard negatives are beneficial in improving the performance of the layer for the semi-supervised classification task. GraphSAGE [5] ex-
node representations on downstream tasks. The contributions of tends GCN to the unsupervised task by employing trainable aggre-
this work can be summarized as follows: gation functions to sample positive and negative examples for the
contrastive objective. Afterwards, DGI [6] attempts to train the con-
• We mathematically prove that sampling hard negatives close trastive model by relying on maximizing local mutual information
to the anchor may relieve the vanishing gradient problem and between graph-level and node-level embeddings. Similarly, GRACE
speed up the model training. [26] maximizes the agreement of node embeddings generated by
• We design an efficient adaptive negative sampling strategy two various augmentations. Collectively, these unsupervised meth-
named AdaNS, which not only relieves the vanishing gradient ods can be unified into the graph contrastive learning paradigm,
problem but also alleviates the computational consumption. and a more comprehensive survey of graph representation learn-
• We conduct experiments on DeepWalk and GraphSAGE mod- ing is provided by [27].
els, applying the proposed AdaNS strategy to sample negatives,
to optimize the node representations for node classification 2.2. Negative Sampling
and visualization tasks. The experimental results on seven real-
world standard graph datasets show the superior performance Negative sampling is originally proposed to reduce the com-
of the proposed AdaNS. plexity of softmax in word2vec [8]. From the perspective of the
sampling scheme, existing negative sampling strategies can be cat-
The remainder of this paper is organized as follows: egorized into two types: static negative sampling and adaptive
Section 2 surveys the related work. Section 3 covers notations sampling for hard negatives. Static negative sampling refers to
and necessary preliminaries. In Section 4, we present theoretical applying a predefined noise distribution to all nodes, such as
insights on noise contrastive estimation and negative sampling. degree-based sampling [8], uniform sampling [28], and WRMF
The proposed model is presented in Section 5. Section 6 reports [29]. Although the static sampling strategy is easy to implement, it
the experimental results. Finally, Section 7 concludes this paper suffers from the vanishing gradient and thereby cannot achieve the
and suggests a future direction. desired performance. The more informative the negative sample
can be by drawing a finite number of negative samples, the larger
the magnitude of the gradient of the loss function becomes, al-
2. Related Work lowing training to be accelerated. The adaptive sampling strategies
are intended for this purpose. DNS [30] is proposed to dynamically
The proposed work builds on a rich line of recent research choose negative samples from the ranked list produced by the cur-
regarding graph representation learning and negative sampling rent prediction scores. WARP [31] follows the assumption that the
strategies. This section reviews related work, including graph rep- rating score of positive items should be higher than those of neg-
resentation learning and negative sampling, to facilitate the ac- ative items, so that for a given positive item, it samples all items
quaintance of researchers. uniformly until a negative item is found. PinSAGE [32] adds ǣhard
2
Y. Wang, L. Hu, W. Gao et al. Pattern Recognition 136 (2023) 109266
Table 1 denotes that the embedding vector of the anchor node w. Gener-
Summary of the Main Notations.
ally, the encoder is implemented as an embedding lookup in the
G, V, E Graph, Node set, Edge set network embedding methods [1,7,11], while in the GNN methods
S Sequence of nodes it is a message-passing based neural network model [5,6]. With
d Dimension of the learned node representation those node representations, a scoring function sθ : V × V → R is
F Mapping function from nodes to d-dimensional representations employed to quantify the compatibility between the context nodes
zv Representation vector or embedding vector of node v and the anchor node. For instance, given an anchor node w and its
w, h Realizations of the anchor and context nodes
context nodes h, the score between them is denoted as sθ (h, w ).
Model parameters
gθ ( · ) Encoder function Generally, the scoring function is instanced as the inner product
sθ (·, · ) Scoring function of the node embeddings, e.g., sθ (h, w ) = gθ (h ) gθ (w ) = z z . In
h w
· Transpose of a vector terms of modeling node representations, the conditional distribu-
p( h | w ) Probability conditioned on w
tion corresponding to anchor w, Pθ (h|w ), is defined as:
Pd (· ), Pn (· ) Data distribution and noise distribution
J
Objective function
exp sθ (h, w )
σ (· ) Sigmoid function pθ ( h|w ) = . (2)
sgn(· ) Sign of the scalar
h ∈V exp sθ (h , w )
3
Y. Wang, L. Hu, W. Gao et al. Pattern Recognition 136 (2023) 109266
termed as Negative sampling, or NEG [7,8,36]. Next, we briefly re- training state has the same importance as positive samples. In-
call the objective of NEG: deed, the adaptive conditional distribution may significantly facil-
itate the optimization of the node representations more than the
k
static marginal distribution.
w
JNEG = Eh∼Pd (h|w ) [log σ (z
h zw )] + Eh ∼Pn (h ) [log σ (−z
h zw )],
j=1 4.4. The Principle of Negative Sampling Strategy
(4)
After determining the importance of adaptively sampling neg-
where h ∼ Pd (h|w ) denotes that sampling context nodes h from the
ative examples, the following query is raised: how to specify an
data distribution Pd (h|w ), h ∼ Pn (h ) denotes that sampling nega-
effective adaptive negative sampling distribution? In response, we
tive nodes h from the noise distribution Pd (h ), and z z denotes
h w demonstrate that the vanishing gradient problem can be solved
the inner product of two node embeddings. Thus, the task is con-
and convergence can be sped up by adaptively sampling hard neg-
verted to distinguish between the context nodes h and the negative
atives from a stochastic gradient descent (SGD) optimization per-
samples h drawn from the noise distribution Pn (h ). In particular,
spective. In this case, Hard negatives are samples with higher
the coefficient k denotes sampling k negative nodes for each con-
scores or samples located closer to the anchor in the embedding
text node.
space. Intuitively, discriminating similar samples can provide more
information to the model than random samples, hence speeding
4.3. From NCE to NEG up the optimization.
The gradient of objective function of NEG is:
In NCE, the negative nodes are sampled i.i.d. from the static
∂ JNEG ∂ z zw ∂ z z
marginal distribution Pn (h ). Indeed, NEG, as the simplification, has = ( σ ( zh zw ) − 1 ) h +k σ (zh zw ) h w , (7)
the same assumption and further theoretically assumes that Pn (h ) ∂θ ∂θ h
∂θ
is a uniform distribution. The next theorem shows that NCE can be where θ denotes the parameters of the encoder model, e.g., node
mathematically converted to NEG with a uniform sampling strat- embedding lookup.
egy. Given a set of (w, h, h ) uniformly sampled in a batch, the
Theorem 1. Let Eqs. (3) and (4) be the objective functions of NCE stochastic gradient descent step is performed as follows:
and NEG, respectively, where Pn (· ) denotes the negative noise distri- znew ← zold − η ( σ ( z z ) − 1 )zw ,
h h h w
bution and k denotes the scaling coefficient, and |V | be the size of the (8)
znew ← zold − η ( σ ( z z ))zw ,
node set V. NEG is a particular case of NCE, if k = |V | and Pn (· ) is the h h h w
uniform distribution. where η is the learning rate. The representation learning model is
optimized by looping over the above Eq. (8). We can notice that
Proof of Theorem 1 is given in Appendix A. According to the gradient magnitude of the embedding for the negative sample
Theorem 1, we know that NEG is mathematically a special case of is dependent on the score (e.g., inner product) with the embedding
NCE with the uniform sampling strategy over the static marginal of the anchor point σ (z z ). It is obvious that if σ (z z ) is close
h w h w
distribution. to 0, nothing can be learned from the sampled case (w, h ) due to
However, uniformly sampling negatives from static marginal its vanishing gradient. Therefore, to alleviate the vanishing gradi-
distribution Pn (h ) may not the optimal manner for learning the ent problem and speed up the model updating, nodes with higher
node representations. For instance, prior work has shown the ef- scores should be sampled with greater probability. This motivates
fectiveness of dynamic hard negatives in the information retrieval that sampling the hard examples h to the anchor w. It is worth
field [30,31,33]. In this work, we similarly explore sampling nega- noting that the score depends on the current model parameters
tive nodes from the conditional distribution on the current training θ , and thus the negative sampling distribution is adaptive and dy-
state. namic in the learning process. Formally, given the anchor node w,
Suppose we define a noise distribution Pn (h|w ) conditioned on we define the adaptive negative sampling distribution as follows:
the representation of the anchor node w under the current train-
zh zw
ing state. Then, we can rewrite the objective function of NEG as Pn (h|w ) =
. (9)
follows: h ∈V zh zw
k
5. AdaNS: A New Strategy
w
JNEG = Eh∼Pd (h|w ) [log σ (z
h zw )] + Eh ∼Pn (h |w ) [log σ (−z
h zw )].
j=1
With the theoretical findings above, we propose an efficient
(5) adaptive negative sampling strategy under the noise contrastive
estimation framework for unsupervised graph representation
Next, to identify the importance of adaptive negative sampling
learning.
on the conditional distribution, we show the optimization objective
for node representation on NEG as follows: 5.1. An Efficient Adaptive Negative Sampling Strategy
Theorem 2. Let Eq. (5) be the objective function of NEG, where
Pd (h|w ) denotes the positive data distribution and Pn (h|w ) denotes We deduce that sampled negative examples should be hard
the negative noise distribution, and sθ (h, w ) be the scoring function. negatives close to the anchor in Eq. (9), however, each sampling
For each pair of anchor node w and context node h, the optimal rep- involves calculating all examples and requires O(|V | · d ) time. To
resentation vectors satisfy: efficiently sample negative examples, in this work, we propose an
approximate negative sampling strategy that formalizes the nega-
kPn (h|w )
sθ (h, w ) = − log . (6) tive sampling distribution as a mixing distribution on dimensions.
Pd (h|w ) Firstly, let the inner products in the scoring function above be
a matrix factorization as follows:
Proof of Theorem 2 is given in Appendix B. According to
d
Theorem 2, we can clearly identify that sampling negative ex- z
h zw = zh , f zw , f , (10)
amples conditioned on the node representations in the current f =1
4
Y. Wang, L. Hu, W. Gao et al. Pattern Recognition 136 (2023) 109266
5
Y. Wang, L. Hu, W. Gao et al. Pattern Recognition 136 (2023) 109266
between nodes and their contexts. In practice, the scoring is im- From the results, we can draw the following observations and con-
plemented as an inner product of node embeddings, and the opti- clusions.
mization objective is to maximize the score between a node and
its contexts and minimize the score between the node and its • For static samplers, the Degree-based negative sampling model
sampled negative nodes, that is, to be close to the context nodes achieves higher Micro-F 1 and Macro-F 1 results compared with
and push apart the negatives in the embedding space. Thus, the RNS in most cases. It demonstrates that the Degree-based neg-
learned node embeddings would preserve the information of the ative sampling strategy can alleviate the vanishing gradient
graph structure. Then, we evaluate negative sampling strategies by problem caused by RNS, thereby improving the performance of
linear classification on the learned embeddings. the node representations.
• The adaptive samplers such as DNS, KBGAN, and WARP out-
6.2.1. Baselines for DeepWalk perform the static samplers in most cases, suggesting that dy-
We include the following two groups of negative sampling namically sampling the hard negatives is better than the prede-
strategies as baselines. fined static distributions. DNS achieves performance over other
Static Negative Sampling Strategies baselines second only to AdaNS on Cora, Wiki, and BlogCatalog,
which proves that drawing negative samples with the highest
• RNS [28]: Random negative sampling (RNS) is one prevalent scores in the subset is beneficial to improving model perfor-
strategy to sample negative nodes with uniform distribution. mance. The performance achieved by both KBGAN is unstable,
• Degree-based Negative Sampling [8]: This strategy is widely only surpasses RNS consistently, and is inferior to the Degree-
used in the field of graph representation learning. It biases the based model on the Wiki and PPI datasets. This may be because
uniform distribution to the node-degree distribution raised to KBGAN is essentially equivalent to importance sampling in sub-
the 3/4rd power. sets, and thus its performance is highly dependent on the sam-
Hard-based Negative Sampling Strategies pling of subsets by uniform sampling. WARP outperforms RNS
and Degree-based strategies in almost all cases and KBGAN in
• DNS [30]: Dynamic negative sampling (DNS) is a state-of-the- 75% of cases, but its performance is lower than that of DNS be-
art sampling strategy for collaborative filtering, which adap- cause its rejection sampling makes it difficult to sample match-
tively picks the negative item scored highest by the current ing nodes before the patient round after the model has been
recommender among a randomly sampled set of unobserved trained to a certain level, which hinders its further performance
items. improvement.
• WARP [31]: The weighted approximate-rank pairwise (WARP) • AdaNS achieves more satisfactory performance over baselines,
adopts uniform sampling with rejection to draw informative especially on the BlogCatalog, where AdaNS consistently out-
negative samples, whose score should be larger than the pos- performs all baselines regardless of training set ratios as well
itive one. as metrics. It indicates that our proposed mixing distribution
• KBGAN [33]: Such model is an adversarial sampler, which uni- sampling is more effective than existing adaptive samplers and
formly randomly samples Ns negative examples to calculate the static samplers.
probability of generating negative samples.
where z denotes the number of labels, T , F , P , and N denote True, • InterCLR [34]: InterCLR presents a semi-hard negative sampling,
False, Positive, and Negative, respectively. We repeat the trial 10 which first samples a pool with the top 10% most similar exam-
times and report the average scores with different training ratios. ple and then randomly draws negatives from the pool.
• Ring [35]: Ring argues that the most similar examples might be
6.2.3. Classification Results of DeepWalk better suited as positive examples rather than negative ones.
Tables 3–6 report the performance of AdaNS in comparison to Therefore, it chooses fairly similar examples, but not too hard
baseline strategies. Notably, the best results are shown in bold. ones, as negatives.
6
Y. Wang, L. Hu, W. Gao et al. Pattern Recognition 136 (2023) 109266
Table 3
Node classification results of DeepWalk on Cora.
Measure Strategies 10% 20% 30% 40% 50% 60% 70% 80% 90%
Degree 0. 6103 0. 6880 0. 7099 0. 7305 0. 7312 0. 7399 0. 7540 0. 7454 0. 7565
RNS 0. 6354 0. 6908 0. 7157 0. 7262 0. 7201 0. 7269 0. 7466 0. 7565 0. 7565
Micro-F 1 DNS 0. 6920 0. 7310 0. 7511 0. 7588 0. 7696 0. 7694 0. 7835 0. 7768 0. 8007
KBGAN 0. 6505 0. 7084 0. 7215 0. 7317 0. 7341 0. 7380 0. 7417 0. 7399 0. 7528
WARP 0. 6707 0. 7116 0. 7409 0. 7480 0. 7631 0. 7648 0. 7586 0. 7556 0. 7648
AdaNS 0. 6924 0. 7476 0. 7574 0. 7717 0.7792 0. 7749 0.7872 0.7970 0.8229
Measure Strategies 10% 20% 30% 40% 50% 60% 70% 80% 90%
Degree 0. 5756 0. 6736 0. 6904 0. 7091 0. 7119 0. 7306 0. 7372 0. 7432 0. 7329
RNS 0. 6185 0. 6828 0. 7056 0. 7135 0. 7104 0. 7093 0. 7326 0. 7377 0. 7091
Macro-F 1 DNS 0.6755 0. 7197 0. 7383 0. 7484 0. 7576 0. 7584 0. 7673 0. 7612 0. 7717
KBGAN 0. 6285 0. 6971 0. 7077 0. 7217 0. 7234 0. 7241 0. 7315 0. 7300 0. 7109
WARP 0. 6532 0. 7028 0. 7286 0. 7417 0. 7508 0. 7555 0. 7455 0. 7435 0. 7556
AdaNS 0. 6754 0.7368 0. 7471 0.7607 0.7663 0. 7644 0.7801 0.7795 0.7898
Table 4
Node classification results of DeepWalk on Wiki.
Measure Strategies 10% 20% 30% 40% 50% 60% 70% 80% 90%
Degree 0. 5630 0. 5962 0. 6105 0. 6334 0. 6509 0. 6632 0. 6648 0. 6694 0. 6349
RNS 0. 5303 0. 5998 0. 6188 0. 6202 0. 6467 0. 6414 0. 6371 0. 6528 0. 6390
Micro-F 1 DNS 0. 5557 0. 6107 0. 6366 0. 6542 0. 6717 0. 6663 0. 6787 0. 6861 0. 6390
KBGAN 0. 5644 0. 6081 0. 6229 0. 6286 0. 6517 0. 6486 0. 6537 0. 6549 0. 6100
WARP 0. 5636 0. 6072 0. 6257 0. 6324 0. 6585 0. 6662 0. 6804 0. 6825 0. 6115
AdaNS 0.5686 0.6201 0. 6449 0. 6694 0.6733 0. 6684 0. 6814 0.6882 0.6681
Measure Strategies 10% 20% 30% 40% 50% 60% 70% 80% 90%
Degree 0. 3975 0. 4740 0. 5084 0. 5105 0. 5322 0. 5261 0. 5634 0. 5730 0. 5275
RNS 0. 3847 0. 4559 0. 5043 0. 5179 0. 5349 0. 5264 0. 5381 0. 5326 0. 5341
Macro-F 1 DNS 0. 4206 0. 5010 0.5362 0. 5350 0. 5554 0. 5634 0. 5773 0. 5940 0. 5414
KBGAN 0. 4293 0. 4774 0. 5113 0. 5197 0. 5442 0. 5263 0. 5397 0. 5308 0. 5003
WARP 0. 4262 0. 4839 0. 5194 0. 5290 0. 5304 0. 5419 0. 5508 0. 5676 0. 5275
AdaNS 0. 4030 0. 4663 0. 5338 0.5659 0.5586 0. 5497 0.5836 0.5961 0. 5564
Table 5
Node classification results of DeepWalk on PPI.
Measure Strategies 10% 20% 30% 40% 50% 60% 70% 80% 90%
Degree 0.1668 0.1802 0.1976 0.1994 0. 2141 0. 2101 0. 2149 0. 2229 0. 2496
RNS 0.1595 0.1753 0.1882 0.1934 0. 2010 0. 2041 0. 2105 0. 2148 0. 2237
Micro-F 1 DNS 0.1691 0.1994 0.1966 0. 2012 0. 2108 0. 2214 0. 2221 0. 2226 0. 2237
KBGAN 0.1702 0.1791 0.1886 0.1904 0. 2043 0. 2082 0. 2193 0. 2185 0. 2323
WARP 0.1733 0.1951 0.1959 0.1977 0.1996 0. 2058 0. 2248 0. 2225 0. 2338
AdaNS 0.1803 0.1988 0.2085 0. 2197 0.2255 0. 2315 0.2379 0.2331 0. 2439
Measure Strategies 10% 20% 30% 40% 50% 60% 70% 80% 90%
Degree 0.1228 0.1396 0.1567 0.1633 0.1756 0.1750 0.1736 0.1755 0.1982
RNS 0.1207 0.1363 0.1464 0.1544 0.1637 0.1703 0.1739 0.1713 0.1905
Macro-F 1 DNS 0.1197 0.1393 0.1401 0.1455 0.1499 0.1554 0.1573 0.1579 0.1594
KBGAN 0.1269 0.1405 0.1493 0.1563 0.1707 0.1773 0.1762 0.1735 0.1808
WARP 0.1318 0.1345 0.1455 0.1476 0.1511 0.1564 0.1729 0.1747 0.1775
AdaNS 0.1340 0.1471 0.1607 0.1745 0.1754 0.1843 0.1849 0.1786 0.1901
Table 6
Node classification results of DeepWalk on BlogCatalog.
Measure Strategies 10% 20% 30% 40% 50% 60% 70% 80% 90%
Degree 0. 2911 0. 3246 0. 3427 0. 3557 0. 3635 0. 3661 0. 3754 0. 3825 0. 4018
RNS 0. 3026 0. 3289 0. 3483 0. 3556 0. 3589 0. 3628 0. 3680 0. 3781 0. 3762
Micro-F 1 DNS 0. 3494 0. 3743 0. 3845 0. 3888 0. 3876 0. 3971 0. 3981 0. 4071 0. 4118
KBGAN 0. 2911 0. 3261 0. 3470 0. 3599 0. 3612 0. 3720 0. 3790 0. 3911 0. 4031
WARP 0. 2906 0. 3220 0. 3471 0. 3532 0. 3578 0. 3802 0. 3896 0. 4015 0. 4053
AdaNS 0.3576 0. 3819 0.3880 0.3957 0.3986 0. 4075 0. 4167 0. 4279 0. 4280
Measure Strategies 10% 20% 30% 40% 50% 60% 70% 80% 90%
Degree 0.1677 0.1968 0. 2115 0. 2381 0. 2312 0. 2306 0. 2401 0. 2416 0. 2755
RNS 0.1702 0.1975 0. 2200 0. 2229 0. 2237 0. 2300 0. 2492 0. 2538 0. 2633
Macro-F 1 DNS 0.1869 0. 2193 0. 2322 0. 2409 0. 2393 0. 2517 0. 2518 0. 2701 0. 2847
KBGAN 0.1693 0.1999 0. 2138 0. 2221 0. 2248 0. 2323 0. 2394 0. 2595 0. 2793
WARP 0.1915 0.1988 0. 2242 0. 2408 0. 2382 0. 2383 0. 2470 0. 2638 0. 2848
AdaNS 0. 2017 0.2367 0. 2449 0. 2473 0. 2477 0. 2649 0.2766 0. 2913 0.2930
7
Y. Wang, L. Hu, W. Gao et al. Pattern Recognition 136 (2023) 109266
Table 7
The classification accuracy results of GraphSAGE with various negative sampling strategies.
Degree 0.815 0.75 0.768 0.694 0.613 0.615 0.821 0.755 0.778
RNS 0.799 0.747 0.749 0.69 0.592 0.639 0.817 0.724 0.758
DNS 0.779 0.754 0.739 0.674 0.589 0.588 0.815 0.745 0.751
KBGAN 0.797 0.759 0.744 0.664 0.584 0.589 0.822 0.755 0.767
InterCLR 0.819 0.769 0.762 0.689 0.61 0.621 0.822 0.756 0.788
Ring 0.817 0.756 0.775 0.69 0.628 0.65 0.824 0.77 0.784
AdaNS 0.833 0.78 0.779 0.704 0.633 0.653 0.826 0.76 0.792
6.3.2. Implementation Details for GraphSAGE viewed as a semi-hard sampling strategy. Such a property fa-
We implement the negative sampling strategies on top of the cilitates the node classification in GNNs, and thus our proposed
unsupervised GraphSAGE [5]. For GraphSAGE, the mean-aggregator AdaNS achieves superior classification performance in eight of
is used in our experiments. We set the dimension of the hid- nine settings.
den layer as 64. In training, we use SGD with a learning rate of
0.01 and no weight decay for 100 epochs and batch size 256. For
Ring, we set the upper percent as 90%. For three citation network 6.4. Graph Visualization
datasets, we use the default three splits, namely full, random, and
public [41]. We evaluate the performance of models by node clas- To evaluate the qualities of node representations, visualization
sification accuracy. is the most common task. In the visualization task, we employ the
t-distributed stochastic neighbor embedding (t-SNE) [43], a nonlin-
6.3.3. Classification Results of GraphSAGE
ear dimension reduction and visualization approach, to transform
We report the classification results of GraphSAGE with various
the node representations into a 2-dimensional space.
negative sampling strategies in Table 7. The best results in each
First, we aim to evaluate the impact of different negative sam-
setting are shown in bold. From the results, we can obtain the fol-
pling strategies on the discriminability of node representations.
lowing observations.
Specifically, we deploy different negative sampling strategies on
• For two static negative sampling strategies, namely Degree and top of the GraphSAGE model to learn node representations on the
RNS, the results are lower than most semi-hard-based strategies Cora dataset. Furthermore, to improve the quality of visualization,
but higher than hard-based ones. Specifically, the performance we adopt a semi-supervised GraphSAGE in this experiment, which
of Degree outperforms RNS in eight of nine settings, which includes both supervised loss and unsupervised loss to train the
implies that the node-degree-based negative sampling strategy model jointly. It is worth noting that since InterCLR can theoreti-
benefits node representation learning in the GNNs model. cally be considered as a special case of Ring [35], and the visual-
• We can observe that two semi-hard-based strategies, namely ization of node representations learned by InterCLR is very similar
InterCLR and Ring, are superior to two hard-based strategies, to that learned by Ring, we exhibit them jointly in a view. The
namely DNS and KBGAN. This phenomenon is attributed to the visualization results are shown in Fig. 2, where different colors de-
fact that the message-passing scheme in GNNs constrains the note nodes in different categories, and the symbol ”x” denotes the
smoothness of the neighboring nodes, making their representa- cluster centroid of each category. The centroid can take into ac-
tions more similar. Since GNNs follow the assumption of homo- count the nodes that deviate from the cluster, thus measuring the
geneity [42], i.e., neighboring nodes are more likely to belong to discriminability of the nodes globally. Geometrically, a larger re-
the same class, the most similar nodes, or the hardest nodes, gion enclosed by cluster centroids indicates better discriminabil-
are more likely to be positive examples, which deviates from ity. We plot the geometrically enclosed region of AdaNS in red and
the hard-based negative sampling strategy. To verify the above, those of the baselines in black. We can observe that the area of the
we study the impact of the message-passing scheme in GNNs geometrically enclosed region generated by our proposed AdaNS
on the hard-based negative sampling strategies. Specifically, we exceeds all baselines. Numerically, we measure the discriminabil-
adopt node embeddings learned on DeepWalk and GraphSAGE ity of node representations via the mean distance between the
with 50 epochs and then choose the 100 most similar nodes for pairwise cluster centroids, where a larger distance indicates that
each anchor, where the nodes with the same class as the an- the node representations preserve better discriminability, and vice
chor are regarded as positive samples. Finally, we calculate the versa indicates lower discriminability, namely, that nodes of differ-
frequency of positive samples for each node. As shown in Fig. 1, ent classes tend to be mixed. We annotate the numerical results of
we plot the histograms of positive sample frequencies of Deep- the centroid distance, and the results show that the node represen-
Walk or GraphSAGE on Cora. We can observe that the node em- tation generated by AdaNS has the largest value, which confirms
beddings learned by GraphSAGE based on the message-passing the superiority of AdaNS.
scheme significantly increase the frequency of positive samples Next, we study the sampling effects in practice using various
among similar nodes compared to DeepWalk, which accord- sampling strategies. Specifically, we randomly sample a batch of
ingly means that the negative sampled on hard-based strategies 100 nodes from the Cora dataset. Given the anchor node, we use
are more likely false negative samples. Therefore, appropriately various sampling strategies to draw 10 negative samples respec-
relaxing the hardness, such as by using the strategy based on tively. A visualization of the sampling results in the embedding
semi-hard, achieves superior classification performance. space is shown in Fig. 3. It is worth noting that in the embedding
• Our proposed strategy adaptively samples negatives from the space, close nodes indicate similarity. We can find that both Degree
mixing distribution, which enables our strategy to sample hard and RNS statically sample negative nodes approximately at random
negative examples. Moreover, our strategy focuses on only within the whole batch. In contrast, DNS clearly tends to select the
some of the dimensional elements, which can be considered closest nodes, and KBGAN can be seen as a relaxed variant of DNS,
a hardness relaxation, so our proposed strategy can also be which has the potential to select the more distant nodes as neg-
8
Y. Wang, L. Hu, W. Gao et al. Pattern Recognition 136 (2023) 109266
Fig. 1. Positive samples histograms of models with DeepWalk or GraphSAGE as encoders on Cora.
Fig. 2. Visualization of node representations on Cora dataset. Different colors denote different categories of nodes.
ative samples. While Ring, InterCLR, and AdaNS tend to select the GraphSAGE, on Cora dataset with {10%, 50%, 90%} training set and
nodes that are relatively closest, but at a certain distance. three splits, respectively. As shown in Fig. 4(a), (c) and (e), the
performance on DeepWalk with adaptive strategies such as DNS,
6.5. Study and Analysis WARP, KBGAN, and AdaNS, are drastically increased in the early
training process compared to the static strategies. This demon-
6.5.1. Classification Quality strates that sampling hard negatives improve both the effective-
Figure 4 presents the classification quality as a function of ness and efficiency of the model. Besides, we observe that DNS
training epochs. We conduct experiments on top of DeepWalk and and WARP, after peaking in performance, begin to degrade as they
9
Y. Wang, L. Hu, W. Gao et al. Pattern Recognition 136 (2023) 109266
Fig. 3. Visualization of various negative sampling strategies in the embedding space. Given an anchor node (red dot), we draw negative samples (blue dots) with different
strategies.
are over-trained. In contrast, our proposed AdaNS, after efficiently tuation of DNS, which is due to its hard-based sampling strategy
achieving optimal performance, remains in a stable state as train- that mistakenly selects positive samples as negative samples dur-
ing proceeds. As for GraphSAGE in Fig. 4(b), (d) and (f), due to ing training. In contrast, KBGAN, InterCLR, and Ring exhibit more
the message-passing scheme, neighboring nodes that tend to be- stable training losses. In particular, in Fig. 5(a), we can observe that
long to the same category naturally have high similarity, so most the training loss of WARP decreases slowly since its negative nodes
of the strategies achieve good accuracy at the early stage of train- sampled must be larger than the positive ones, which inevitably
ing, where the adaptive strategies still show the higher ceiling. And sample a lot of false negatives. In contrast, DNS consistently sam-
as the training proceeds, the strategies based on static distribution, ples the hardest negative nodes, resulting in a rapid and continu-
i.e., Degree and RNS, exhibit lackluster performance. In particular, ous decrease in training loss. In conclusion, adaptive negative sam-
DNS still declines sharply after reaching peak performance, even pling strategies are able to obtain lower training losses faster and
earlier than on DeepWalk, due to the smoothness of the message- more consistently, which demonstrates the capability to mitigate
passing scheme in GNNs. In contrast, semi-hard-based strategies the vanishing gradient problem.
achieve more stable performance.
6.5.3. Efficiency Analysis
6.5.2. Training Loss Adaptive sampling is time-consuming in comparison to static
To investigate the effect of adaptive negative sampling on alle- sampling while improving performance, hence the efficiency of
viating vanishing gradient, we experimentally record the training sampling negatives is critical. The average running time per epoch
loss as a function of training epochs for various negative sampling for adaptive strategies is summarized in Fig. 6. As the rejection
strategies on the Cora dataset. As shown in Fig. 5, we can observe mechanism of WARP increases in difficulty as training progresses,
that the adaptive sampling strategy can better optimize the train- we adopt the average running time per epoch for a fair compari-
ing loss compared to the two static sampling strategies, Degree and son. From the figure, we can find that the running time of AdaNS
RNS. Taking AdaNS as a benchmark, we can see the obvious fluc- is the least, while WARP takes the most running time, due to its
10
Y. Wang, L. Hu, W. Gao et al. Pattern Recognition 136 (2023) 109266
rejection mechanism. Both DNS and KBGAN calculate the proba- 7. Conclusion and Discussions
bility of the negative samples from a subset of candidates, while
KBGAN is the relatively more efficient strategy. In short, the pro- Summary. In this paper, an adaptive negative sampling strat-
posed adaptive sampling strategy AdaNS is satisfactory in terms egy, named AdaNS, for unsupervised graph representation learn-
of performance and efficiency compared to other state-of-the-art ing is proposed. Different from the existing strategies that sample
strategies. randomly negative nodes, AdaNS adopts an efficient and effective
11
Y. Wang, L. Hu, W. Gao et al. Pattern Recognition 136 (2023) 109266
References
[1] A. Grover, J. Leskovec, node2vec: Scalable feature learning for networks, in:
ACM SIGKDD, 2016, pp. 855–864.
[2] T.N. Kipf, M. Welling, Semi-supervised classification with graph convolutional
networks, ICLR, 2017.
[3] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, L. Wang, Graph contrastive learning with
adaptive augmentation, in: WWW, 2021, pp. 2069–2080.
[4] F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, K. Weinberger, Simplifying graph con-
volutional networks, in: International conference on machine learning, PMLR,
2019, pp. 6861–6871.
[5] W.L. Hamilton, Z. Ying, J. Leskovec, Inductive representation learning on large
graphs, in: NeurIPS, 2017, pp. 1024–1034.
[6] P. Velickovic, W. Fedus, W.L. Hamilton, P. Liò, Y. Bengio, R.D. Hjelm, Deep graph
infomax, ICLR, 2019.
[7] B. Perozzi, R. Al-Rfou, S. Skiena, Deepwalk: online learning of social represen-
tations, in: ACM SIGKDD, 2014, pp. 701–710.
[8] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed represen-
tations of words and phrases and their compositionality, in: NeurIPS, 2013,
pp. 3111–3119.
[9] M. Gutmann, A. Hyvärinen, Noise-contrastive estimation: A new estimation
Fig. 6. The running time per epoch for different negative sampling strategies. Note principle for unnormalized statistical models, in: AISTATS, volume 9, 2010,
that the unit of time for BlogCatalog is minutes, while the unit of time for the other pp. 297–304.
datasets is seconds. [10] A. Mnih, Y.W. Teh, A fast and simple algorithm for training neural probabilistic
language models, ICML, 2012.
[11] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, Q. Mei, LINE: large-scale information
network embedding, in: WWW, 2015, pp. 1067–1077.
way to implement negative sampling by drawing hard negatives [12] L.F.R. Ribeiro, P.H.P. Saverese, D.R. Figueiredo, struc2vec: Learning node repre-
from the mixing distribution with respect to the dimensional ele- sentations from structural identity, in: ACM SIGKDD, 2017, pp. 385–394.
ments in the node vectors. We conduct experiments on node clas- [13] P. Wang, S. Li, R. Pan, Incorporating GAN for negative sampling in knowledge
representation learning, in: AAAI, 2018, pp. 2005–2012.
sification and visualization tasks to evaluate the proposed strategy. [14] J. Wang, L. Yu, W. Zhang, Y. Gong, Y. Xu, B. Wang, P. Zhang, D. Zhang, IRGAN: A
The experimental results on seven benchmark datasets show that minimax game for unifying generative and discriminative information retrieval
AdaNS is very competitive with state-of-the-art strategies. models, in: SIGIR, 2017, pp. 515–524.
[15] H. Gao, H. Huang, Self-paced network embedding, in: Y. Guo, F. Farooq (Eds.),
Limitations of this work. There are several limitations from ACM SIGKDD, 2018, pp. 1406–1415.
theoretical analysis and experimental justification. 1) To make the [16] Z. Zhang, Y. Zeng, L. Bai, Y. Hu, M. Wu, S. Wang, E.R. Hancock, Spectral bound-
theoretical analysis more feasible, we make a few assumptions. ing: Strictly satisfying the 1-lipschitz property for generative adversarial net-
works, Pattern Recognit. 105 (2020) 107179.
We derive the parametric embedding as an instance in our anal- [17] D. Wang, P. Cui, W. Zhu, Structural deep network embedding, in: ACM SIGKDD,
ysis, since the main focus is the effect of negative samples on the 2016, pp. 1225–1234.
gradient update. While the experiment results empirically demon- [18] Z. Zhang, P. Cui, X. Wang, J. Pei, X. Yao, W. Zhu, Arbitrary-order proximity pre-
served network embedding, in: ACM SIGKDD, 2018, pp. 2778–2786.
strate that our analysis seems to hold with GNN models, more for-
[19] C. Zhou, Y. Liu, X. Liu, Z. Liu, J. Gao, Scalable graph embedding for asymmetric
mal investigation for deep neural network models is valuable. 2) proximity, in: AAAI, 2017, pp. 2942–2948.
Our proposed method is only verified on benchmark graphs, while [20] A. Tsitsulin, D. Mottin, P. Karras, E. Müller, VERSE: versatile graph embeddings
there are more challenging in real-world scenarios, e.g., dynamic from similarity measures, in: WWW, 2018, pp. 539–548.
[21] Y. Shi, M. Lei, H. Yang, L. Niu, Diffusion network embedding, Pattern Recognit.
graphs and hypergraphs. Hence, it is crucial to devote more efforts 88 (2019) 518–531.
to studying more complicated graphs. We believe our findings es- [22] C. Donnat, M. Zitnik, D. Hallac, J. Leskovec, Learning structural node embed-
tablished a solid foundation for further research. dings via diffusion wavelets, in: ACM SIGKDD, 2018, pp. 1320–1329.
[23] C. Plant, S. Biedermann, C. Böhm, Data compression as a comprehensive frame-
work for graph drawing and representation learning, in: ACM SIGKDD, 2020,
pp. 1212–1222.
Declaration of Competing Interest [24] C. Böhm, C. Plant, Massively parallel graph drawing and representation learn-
ing, in: 2020 IEEE International Conference on Big Data (Big Data), IEEE, 2020,
The authors declare that they have no known competing finan- pp. 609–616.
[25] Z. Zhang, D. Chen, Z. Wang, H. Li, L. Bai, E.R. Hancock, Depth-based sub-
cial interests or personal relationships that could have appeared to
graph convolutional auto-encoder for network representation learning, Pattern
influence the work reported in this paper. Recognit. 90 (2019) 363–376.
[26] Y. Zhu, Y. Xu, F. Yu, Q. Liu, S. Wu, L. Wang, Deep graph contrastive representa-
Data availability tion learning, GRL+@ICML, 2020.
[27] X. Liu, J. Tang, Network representation learning: A macro and micro view, AI
Open 2 (2021) 43–64.
Data will be made available on request. [28] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-Thieme, BPR: bayesian per-
sonalized ranking from implicit feedback, in: UAI, 2009, pp. 452–461.
[29] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering for implicit feedback
Acknowledgement datasets, in: ICDM, 2008, pp. 263–272.
[30] W. Zhang, T. Chen, J. Wang, Y. Yu, Optimizing top-n collaborative filtering via
dynamic negative item sampling, in: SIGIR, 2013, pp. 785–788.
This work is funded by: Postdoctoral Innovative Talents Support [31] T. Zhao, J.J. McAuley, I. King, Improving latent factor models via personalized
Program under Grant No. BX20190137, and China Postdoctoral Sci- feature projection for one class recommendation, in: CIKM, 2015, pp. 821–830.
ence Foundation funded project under Grant No. 2020M670839, [32] R. Ying, R. He, K. Chen, P. Eksombatchai, W.L. Hamilton, J. Leskovec, Graph
convolutional neural networks for web-scale recommender systems, in: ACM
and by the Fundamental Research Funds for the Central Univer-
SIGKDD, 2018, pp. 974–983.
sities, JLU No. 93K172020K36, and by Science Foundation of Jilin [33] L. Cai, W.Y. Wang, KBGAN: adversarial learning for knowledge graph embed-
Province of China under Grant No. 2020122209JC. dings, in: NAACL-HLT, 2018, pp. 1470–1480.
[34] J. Xie, X. Zhan, Z. Liu, Y.S. Ong, C.C. Loy, Delving into inter-image invariance for
unsupervised visual representations, arXiv preprint arXiv:2008.11702 (2020).
Supplementary material [35] M. Wu, M. Mosse, C. Zhuang, D. Yamins, N.D. Goodman, Conditional negative
sampling for contrastive learning of visual representations, ICLR, 2021.
[36] A. Mnih, K. Kavukcuoglu, Learning word embeddings efficiently with noise–
Supplementary material associated with this article can be contrastive estimation, in: NeurIPS, 2013, pp. 2265–2273.
found, in the online version, at doi:10.1016/j.patcog.2022.109266. [37] S. Prithviraj, G. Namata, M. Bilgic, L. Getoor, B. Gallagher, T. Eliassi-Rad, Collec-
tive classification in network data, AI Mag. 29 (3) (2008) 93–106.
12
Y. Wang, L. Hu, W. Gao et al. Pattern Recognition 136 (2023) 109266
[38] A. McCallum, K. Nigam, J. Rennie, K. Seymore, Automating the construction of Wanfu Gao received his M.S. and Ph.D. degrees in the College of Computer Sci-
internet portals with machine learning, Inf. Retr. 3 (2) (20 0 0) 127–163. ence from Jilin University in 2016 and 2019. He is doing post-doctoral research in
[39] B.-J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M.S. Livstone, the College of Chemistry in Jilin University. His research interests include machine
R. Oughtred, D.H. Lackner, J. Bähler, V. Wood, K. Dolinski, M. Tyers, The bi- learning and feature selection.
oGRID interaction database: 2008 update, Nucleic Acids Res. 36 (Database-Is-
sue) (2008) 637–640. Xiaofeng Cao received his Ph.D. degree at Australian Artificial Intelligence Institute,
[40] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep con- University of Technology Sydney, Australia. He is currently an Associate Professor
volutional neural networks, in: NeurIPS, 2012, pp. 1106–1114. at the School of Artificial Intelligence, Jilin University, China and leading a Machine
[41] J. Chen, T. Ma, C. Xiao, FastGCN: Fast learning with graph convolutional net- Perceptron Research Group with more than 15 PhD and Master students. He has
works via importance sampling, ICLR, 2018. published more than 10 technical papers in top tier journals and conferences, such
[42] J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, D. Koutra, Beyond homophily as IEEE T-PAMI, IEEE TNNLS, IEEE T-CYB, CVPR, IJCAI. His research interests include
in graph neural networks: Current limitations and effective designs, NeurIPS, PAC learning theory, agnostic learning algorithm, generalization analysis, and hyper-
2020. bolic geometry.
[43] L. Van der Maaten, G. Hinton, Visualizing data using t-SNE, Journal of machine
learning research 9 (11) (2008).
Yi Chang received the Ph.D. degree in computer science from the University of
Southern California, Los Angeles, CA, USA, in 2016. He is currently the Dean of the
Yu Wang received the B.E degree and the M.S. degree from Jilin University in 2015 School of Artificial Intelligence, Jilin University, Changchun, China. He has published
and 2018, where he is currently pursuing the Ph.D. degree. His research interests more than 100 research papers in premium conferences or journals. He has broad
include data mining and machine learning. research interests on information retrieval, data mining, machine learning, and nat-
ural language processing. Dr. Chang is an Associate Editor of the IEEE Transactions
Liang Hu received his PhD degree in the College of Computer Science from Jilin on Knowledge and Data Engineering.
University in 1999. Currently, he is a professor at the College of Computer Science,
Jilin University, China. His research areas are artificial intelligence and distributed
computing.
13