0% found this document useful (0 votes)
71 views9 pages

2016-Revisiting Semi-Supervised Learning With Graph Embeddings

Revisiting Semi-Supervised Learning with Graph Embeddings

Uploaded by

Hedy Liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views9 pages

2016-Revisiting Semi-Supervised Learning With Graph Embeddings

Revisiting Semi-Supervised Learning with Graph Embeddings

Uploaded by

Hedy Liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Revisiting Semi-Supervised Learning with Graph Embeddings

Zhilin Yang ZHILINY @ CS . CMU . EDU


William W. Cohen WCOHEN @ CS . CMU . EDU
Ruslan Salakhutdinov RSALAKHU @ CS . CMU . EDU
School of Computer Science, Carnegie Mellon University

Abstract methods learn embeddings that predict a distributional con-


text, e.g. a word embedding might predict nearby context
We present a semi-supervised learning frame-
words (Mikolov et al., 2013; Pennington et al., 2014), or
work based on graph embeddings. Given a graph
a node embedding might predict nearby nodes in a graph
between instances, we train an embedding for
(Perozzi et al., 2014; Tang et al., 2015). Embeddings
each instance to jointly predict the class label and
trained with distributional context can be used to boost the
the neighborhood context in the graph. We de-
performance of related tasks. For example, word embed-
velop both transductive and inductive variants of
dings trained from a language model can be applied to part-
our method. In the transductive variant of our
of-speech tagging, chunking and named entity recognition
method, the class labels are determined by both
(Collobert et al., 2011; Yang et al., 2016).
the learned embeddings and input feature vec-
tors, while in the inductive variant, the embed- In this paper we consider not word embeddings but graph
dings are defined as a parametric function of the embeddings. Existing results show that graph embeddings
feature vectors, so predictions can be made on in- are effective at classifying the nodes in a graph, such as
stances not seen during training. On a large and user behavior prediction in a social network (Perozzi et al.,
diverse set of benchmark tasks, including text 2014; Tang et al., 2015). However, the graph embeddings
classification, distantly supervised entity extrac- are usually learned separately from the supervised task,
tion, and entity classification, we show improved and hence do not leverage the label information in a spe-
performance over many of the existing models. cific task. Hence graph embeddings are in some sense
complementary to graph Laplacian regularization that does
not produce useful features itself and might not be able to
1. Introduction fully leverage the distributional information encoded in the
graph structure.
Semi-supervised learning aims to leverage unlabeled data
to improve performance. A large number of semi- The main highlight of our work is to incorporate em-
supervised learning algorithms jointly optimize two train- bedding techniques into the graph-based semi-supervised
ing objective functions: the supervised loss over labeled learning setting. We propose a novel graph-based semi-
data and the unsupervised loss over both labeled and unla- supervised learning framework, Planetoid (Predicting La-
beled data. Graph-based semi-supervised learning defines bels And Neighbors with Embeddings Transductively Or
the loss function as a weighted sum of the supervised loss Inductively from Data). The embedding of an instance is
over labeled instances and a graph Laplacian regulariza- jointly trained to predict the class label of the instance and
tion term (Zhu et al., 2003; Zhou et al., 2004; Belkin et al., the context in the graph. We then concatenate the embed-
2006; Weston et al., 2012). The graph Laplacian regular- dings and the hidden layers of the original classifier and
ization is based on the assumption that nearby nodes in a feed them to a softmax layer when making the prediction.
graph are likely to have the same labels. Graph Laplacian Since the embeddings are learned based on the graph struc-
regularization is effective because it constrains the labels to ture, the above method is transductive, which means we
be consistent with the graph structure. can only predict instances that are already observed in the
Recently developed unsupervised representation learning graph at training time. In many cases, however, it may be
desirable to have an inductive approach, where predictions
Proceedings of the 33 rd International Conference on Machine can be made on instances unobserved in the graph seen at
Learning, New York, NY, USA, 2016. JMLR: W&CP volume training time. To address this issue, we further develop an
48. Copyright 2016 by the author(s). inductive variant of our framework, where we define the
Revisiting Semi-Supervised Learning with Graph Embeddings

embeddings as a parameterized function of input feature Graph-based semi-supervised learning is based on the as-
vectors; i.e., the embeddings can be viewed as hidden lay- sumption that nearby nodes tend to have the same la-
ers of a neural network. bels. Generally, the loss function of graph-based semi-
supervised learning in the binary case can be written as
To demonstrate the effectiveness of our proposed approach,
we conducted experiments on five datasets for three tasks, L
X X
including text classification, distantly supervised entity ex- l(yi , f (xi )) + λ aij kf (xi ) − f (xj )k2
traction, and entity classification. Our inductive method i=1 i,j

outperforms the second best inductive method by up to L


X
18.7%1 points and on average 7.8% points in terms of ac- = l(yi , f (xi )) + λf T ∆f (1)
curacy. The best of our inductive and transductive methods i=1

outperforms the best of all the other compared methods by


In Eq. (1), the first term is the standard supervised loss
up to 8.5% and on average 4.1%.
function, where l(·, ·) can be log loss, squared loss or hinge
loss. The second term is the graph Laplacian regular-
2. Related Work ization, which incurs a large penalty when similar nodes
with a large wij are predicted to have different labels
2.1. Semi-Supervised Learning
f (xi ) 6= f (xj ). The graph Laplacian matrix ∆ is defined
Let L and U be the number of labeled and unlabeled in- as ∆ = A − D, where D P is a diagonal matrix with each
stances. Let x1:L and xL+1:L+U denote the feature vec- entry defined as dii = j aij . λ is a constant weight-
tors of labeled and unlabeled instances respectively. The ing factor. (Note that we omit the parameter regularization
labels y1:L are also given. Based on both labeled and un- terms for simplicity.) Various graph-based semi-supervised
labeled instances, the problem of semi-supervised learning learning algorithms define the loss functions as variants of
is defined as learning a classifier f : x → y. There are Eq. (1). Label propagation (Zhu et al., 2003) forces f to
two learning paradigms, transductive learning and induc- agree with labeled instances y1:L ; f is a label lookup table
tive learning. Transductive learning (Zhu et al., 2003; Zhou for unlabeled instances in the graph, and can be obtained
et al., 2004) only aims to apply the classifier f on the unla- with a closed-form solution. Learning with local and global
beled instances observed at training time, and the classifier consistency (Zhou et al., 2004) defines l as squared loss and
does not generalize to unobserved instances. For instance, f as a label lookup table; it does not force f to agree with
transductive support vector machine (TSVM) (Joachims, labeled instances. Modified Adsorption (MAD) (Talukdar
1999) maximizes the “unlabeled data margin” based on the & Crammer, 2009) is a variant of label propagation that
low-density separation assumption that a good decision hy- allows prediction on labeled instances to vary and incor-
perplane lies on a sparse area of the feature space. Inductive porates node uncertainty. Manifold regularization (Belkin
learning (Belkin et al., 2006; Weston et al., 2012), on the et al., 2006) parameterizes f in the Reproducing Kernel
other hand, aims to learn a parameterized classifier f that Hilbert Space (RKHS) with l being squared loss or hinge
is generalizable to unobserved instances. loss. Since f is a parameterized classifier, manifold regu-
larization is inductive and can naturally handle unobserved
2.2. Graph-Based Semi-Supervised Learning instances.
In addition to labeled and unlabeled instances, a graph, de- Semi-supervised embedding (Weston et P al., 2012) extends
noted as a (L + U ) × (L + U ) matrix A, is also given to the regularization term in Eq. (1) to be i,j aij kg(xi ) −
graph-based semi-supervised learning methods. Each entry g(xj )k2 , where g represents embeddings of instances,
aij indicates the similarity between instance i and j, which which can be the output labels, hidden layers or auxiliary
can be either labeled or unlabeled. The graph A can either embeddings in a neural network. By extending the regu-
be derived from distances between instances (Zhu et al., larization from f to g, this method imposes stronger con-
2003), or be explicitly derived from external data, such as straints on a neural network. Iterative classification algo-
a knowledge graph (Wijaya et al., 2013) or a citation net- rithm (ICA) (Sen et al., 2008) uses a local classifier that
work between documents (Ji et al., 2010). In this paper, takes the labels of neighbor nodes as input, and employs an
we mainly focus on the setting that a graph is explicitly iterative process between estimating the local classifier and
given and represents additional information not present in assigning new labels.
the feature vectors (e.g., the graph edges correspond to hy-
perlinks between documents, rather than distances between 2.3. Learning Embeddings
the bag-of-words representation of a document).
Extensive research was done on learning graph embed-
1
% refers to absolute percentage points thoughout the paper. dings. A probabilistic generative model was proposed to
learn node embeddings that generate the edges in a graph
Revisiting Semi-Supervised Learning with Graph Embeddings


Table 1. Comparison of various semi-supervised learning algorithms and graph embedding algorithms. means using the given for-
mulation or information; ×means not available or not using the information. In the column graph, regularization means imposing
regularization with the graph structure; features means using graph structure as features; context means predicting the graph context.

Method Features Labels Paradigm Embeddings Graph


√ √
TSVM (Joachims, 1999) √ Transductive × ×
Label propagation (Zhu et al., 2003) ×
√ √ Transductive × Regularization
Manifold Reg (Belkin et al., 2006) √ Inductive × Regularization
ICA (Sen et al., 2008) × √ Transductive × Features
MAD (Talukdar & Crammer, 2009) ×
√ √ Transductive ×
√ Regularization
Semi Emb (Weston et al., 2012) Inductive √ Regularization
Graph Emb (Perozzi et al., 2014) ×
√ ×
√ Transductive √ Context
Planetoid (this paper) Both Context

(Snijders & Nowicki, 1997). A clustering method (Hand- coders were used to learn graph embeddings for clustering
cock et al., 2007) was proposed to learn latent social states on graphs (Tian et al., 2014).
in a social network to predict social ties.
More recently, a number of embedding learning methods 2.4. Comparison
are based on the Skipgram model, which is a variant of We compare our approach in this paper with other methods
the softmax model. Given an instance and its context, the in semi-supervised learning and embedding learning in Ta-
objective of Skipgram is usually formulated as minimizing ble 1. Unlike our approach, conventional graph Laplacian
the log loss of predicting the context using the embedding based methods (Zhu et al., 2003; Belkin et al., 2006; Taluk-
of an instance as input features. Formally, let {(i, c)} be a dar & Crammer, 2009) impose regularization on the labels
set of pairs of instance i and context c, the loss function can but do not learn embeddings. Semi-supervised embedding
be written as method (Weston et al., 2012) learns embeddings in a neural
! network, but our approach is different from this method in
that instead of imposing regularization, we use the embed-
X X X
T T
− log p(c|i) = − wc ei − log exp(wc0 ei )
(i,c) (i,c) c0 ∈C dings to predict the context in the graph. Graph embedding
(2) methods (Perozzi et al., 2014; Tian et al., 2014) encode the
where C is the set of all possible context, w’s are param- graph structure into embeddings; however, different from
eters of the Skipgram model, and ei is the embedding of our approach, these methods are purely unsupervised and
instance i. Skipgram was first introduced to learn repre- do not leverage label information for a specific task. More-
sentations of words, known as word2vec (Mikolov et al., over, these methods are transductive and cannot be directly
2013). In word2vec, for each training pair (i, c), the in- generalized to instances unseen at training time.
stance i is the current word whose embedding is under es-
timation; the context c is each of the surrounding words 3. Semi-Supervised Learning with Graph
of i within a fixed window size in a sentence; the context Embeddings
space C is the vocabulary of the corpus. Skipgram was later
extended to learn graph embeddings. Deepwalk (Perozzi Following the notations in the previous section, the input
et al., 2014) uses the embedding of a node to predict the to our method includes labeled instances x1:L , y1:L , unla-
context in the graph, where the context is generated by ran- beled instances xL+1:L+U and a graph denoted as a matrix
dom walk. More specifically, for each training pair (i, c), A. Each instance i has an embedding denoted as ei .
the instance i is the current node whose embedding is un-
We formulate our framework based on feed-forward neural
der estimation; the context c is each of the neighbor nodes
networks. Given the input feature vector x, the k-th hidden
within a fixed window size in a generated random walk se-
layer of the network is denoted as hk , which is a nonlin-
quence; the context space C is all the nodes in the graph.
ear function of the previous hidden layer hk−1 defined as:
LINE (Tang et al., 2015) extends the model to have mul-
hk (x) = ReLU(Wk hk−1 (x) + bk ), where Wk and bk
tiple context spaces C for modeling both first and second
are parameters of the k-th layer, and h0 (x) = x. We adopt
order proximity.
rectified linear unit ReLU(x) = max(0, x) as the nonlinear
Although Skipgram-like models for graphs have received function in this work.
much recent attention, many other models exist. TransE
The loss function of our framework can be expressed as
(Bordes et al., 2013) learns the embeddings of entities in
a knowledge graph jointly with their relations. Autoen- Ls + λLu ,
Revisiting Semi-Supervised Learning with Graph Embeddings

vised loss with negative sampling can be written as

Lu = −E(i,c,γ) log σ(γwcT ei ) (3)

The distribution p(i, c, γ) is conditioned on labels y1:L and


the graph A. However, since they are the input to our al-
gorithm and kept fixed, we drop the conditioning in our
notation.
We now define the distribution p(i, c, γ) directly using
a sampling process, which is illustrated in Algorithm 1.
There are two types of context that are sampled in this al-
gorithm. The first type of context is based on the graph
A, which encodes the structure (distributional) informa-
tion, and the second type of context is based on the labels,
which we use to inject label information into the embed-
Figure 1. An example of sampling from context distribution
dings. We use a parameter r1 ∈ (0, 1) to control the ratio
p(i, c, γ) when γ = 1 and d = 2. In circles, +1 denotes pos-
of positive and negative samples, and use r2 ∈ (0, 1) to
itive instances, −1 denotes negative instances, and ? denotes un-
labeled instances. If random < r2 , we first sample a random control the ratio of two types of context.
walk 2 → 1 → 4 → 6, and then sample two nodes in the ran- With probability r2 , we sample the context based on the
dom walk within distance d. If random ≥ r2 , we sample two graph A. We first uniformly sample a random walk se-
instances with the same labels.
quence S. More specifically, we uniformly sample the
first instance S1 from the set 1 : L + U . Given the
where Ls is a supervised loss of predicting the labels, and previous instance Sk−1 = i, the next instance Sk = j
PL+U
Lu is an unsupervised loss of predicting the graph context. is sampled with probability aij / j 0 =1 aij 0 . With prob-
In the following sections, we first formulate Lu by intro- ability r1 , we sample a positive pair (i, c) from the set
ducing how to sample context from the graph, and then {(Sj , Sk ) : |j − k| < d}, where d is another parameter
formulate Ls to form our semi-supervised learning frame- determining the window size. With probability (1 − r1 ),
work. we uniformly corrupt the context c to sample a negative
pair.
3.1. Sampling Context With probability (1 − r2 ), we sample the context based on
We formulate the unsupervised loss Lu as a variant of Eq. the class labels. Positive pairs have the same labels and
(2). Given a graph A, the basic idea of our approach is to negative pairs have different labels. Only labeled instances
sample pairs of instance i and context c, and then formulate 1 : L are sampled.
the loss Lu using the log loss − log p(c|i) as in Eq. (2). We Our random walk based sampling method is built upon
first present the formulation of Lu by introducing negative Deepwalk (Perozzi et al., 2014). In contrast to their
sampling, and then discuss how to sample pairs of instance method, our method handles real-valued A, incorporates
and context. negative sampling, and explicitly samples from labels with
It is usually intractable to directly optimize Eq. (2) due probability (1 − r2 ) to inject supervised information.
to normalization over the whole context space C. Nega- An example of sampling when γ = 1 is shown in Figure 1.
tive sampling was introduced to address this issue (Mikolov
et al., 2013), which samples negative examples to approxi- 3.2. Transductive Formulation
mate the normalization term. In our case, we are sampling
(i, c, γ) from a distribution, where i and c denote instance In this section, we present a method that infers the labels
and context respectively, γ = +1 means (i, c) is a positive of unlabeled instances yL+1:L+U without generalizing to
pair and γ = −1 means negative. Given (i, c, γ), we min- unobserved instances. Transductive learning usually per-
imize the cross entropy loss of classifying the pair (i, c) to forms better than inductive learning because transductive
a binary label γ: learning can leverage the unlabeled test data when training
the model (Joachims, 1999).
−I(γ = 1) log σ(wcT ei ) − I(γ = −1) log σ(−wcT ei ),
We apply k layers on the input feature vector x to obtain
where σ is the sigmoid function defined as σ(x) = 1/(1 + hk (x), and l layers on the embedding e to obtain hl (e), as
e−x ), and I(·) is an indicator function that outputs 1 when illustrated in Figure 2(a). The two hidden layers are con-
the argument is true, otherwise 0. Therefore, the unsuper- catenated, and fed to a softmax layer to predict the class
Revisiting Semi-Supervised Learning with Graph Embeddings

Algorithm 1 Sampling Context Distribution p(i, c, γ)


Input: graph A, labels y1:L , parameters r1 , r2 , q, d
Initialize triplet (i, c, γ)
if random < r1 then γ ← +1 else γ ← −1
if random < r2 then
Uniformly sample a random walk S of length q
Uniformly sample (Sj , Sk ) with |j − k| < d
i ← Sj , c ← Sk (a) Transductive Formulation
if γ = −1 then uniformly sample c from 1 : L + U
else
if γ = +1 then
Uniformly sample (i, c) with yi = yc
else
Uniformly sample (i, c) with yi 6= yc
end if
end if
return (i, c, γ)

(b) Inductive Formulation


label of the instance. More specifically, the probability of
predicting the label y is written as: Figure 2. Network architecture: transductive v.s. inductive. Each
dotted arrow represents a feed-forward network with an arbitrary
exp[hk (x)T , hl (e)T ]wy number of layers (we use only one layer in our experiments).
p(y|x, e) = P k T l T
, (4)
y 0 exp[h (x) , h (e) ]wy 0
Solid arrows denote direct connections.

where [·, ·] denotes concatenation of two row vectors, the beddings incrementally, which is time consuming and does
super script hT denotes the transpose of vector h, and w not scale (and not inductive essentially).
represents the model parameter. To make the method inductive, the prediction of label y
Combined with Eq. (3), the loss function of transductive should only depend on the input feature vector x. There-
learning is defined as: fore, we define the embedding e as a parameterized func-
tion of feature x, as shown in Figure 2(b). Similar to the
L
1X transductive formulation, we apply k layers on the input
− log p(yi |xi , ei ) − λE(i,c,γ) log σ(γwcT ei ), feature vector x to obtain hk (x). However, rather than
L i=1
using a “free” embedding, we apply l1 layers on the in-
where the first term is defined by Eq. (4), and λ is a con- put feature vector x and define it as the embedding e =
stant weighting factor. The first term is the loss function of hl1 (x). Then another l2 layers are applied on the em-
class label prediction and the second term is the loss func- bedding hl2 (e) = hl2 (hl1 (x)), denoted as hl (x) where
tion of context prediction. This formulation is transductive l = l1 + l2 . The embedding e in this formulation can be
because the prediction of label y depends on the embed- viewed as a hidden layer that is a parameterized function of
ding e, which can only be learned for instances observed in the feature x.
the graph A during training time. With the above formulation, the label y only depends on
the feature x. More specifically,
3.3. Inductive Formulation
exp[hk (x)T , hl (x)T ]wy
While we consider transductive learning in the above for- p(y|x) = P k T l T
(5)
y 0 exp[h (x) , h (x) ]wy
0
mulation, in many cases, it is desirable to learn a classi-
fier that can generalize to unobserved instances, especially
for large-scale tasks. For example, machine reading sys- Replacing ei in Eq. (3) with hl1 (xi ), the loss function of
tems (Carlson et al., 2010) very frequently encounter novel inductive learning is
entities on the Web and it is not practical to train a semi-
L
supervised learning system on the entire Web. However, 1X
− log p(yi |xi ) − λE(i,c,γ) log σ(γwcT hl1 (xi ))
since learning graph embeddings is transductive in nature, L i=1
it is not straightforward to do it in an inductive setting. Per-
ozzi et al. (2014) addressed this issue by retraining the em- where the first term is defined by Eq. (5).
Revisiting Semi-Supervised Learning with Graph Embeddings

Algorithm 2 Model Training (Transductive)


Table 3. Accuracy on text classification. Upper rows are inductive
Input: A, x1:L+U , y1:L , λ, batch iterations T1 , T2 and methods and lower rows are transductive methods.
sizes N1 , N2
repeat M ETHOD C ITESEER C ORA P UBMED
for t ← 1 to T1 do F EAT 0.572 0.574 0.698
P of labeled instances i of size N1
Sample a batch M ANI R EG 0.601 0.595 0.707
Ls = − N11 i p(yi |xi , ei ) S EMI E MB 0.596 0.590 0.711
Take a gradient step for Ls P LANETOID -I 0.647 0.612 0.772
end for TSVM 0.640 0.575 0.622
for t ← 1 to T2 do LP 0.453 0.680 0.630
P of context from p(i, c, γ) of size N2
Sample a batch G RAPH E MB 0.432 0.672 0.653
Lu = − N12 (i,c,γ) log σ(γwcT ei ) P LANETOID -G 0.493 0.691 0.664
Take a gradient step for Lu P LANETOID -T 0.629 0.757 0.757
end for
until stopping
Table 4. Recall@k on DIEL distantly-supervised entity extrac-
tion. Upper rows are inductive methods and lower rows are trans-
ductive methods. Results marked with ∗ are taken from the origi-
Table 2. Dataset statistics. nal DIEL paper (Bing et al., 2015) with the same data splits.

DATASET # CLASSES # NODES # EDGES M ETHOD R ECALL @k



C ITESEER 6 3,327 4,732 F EAT 0.349
C ORA 7 2,708 5,429 M ANI R EG 0.477
P UBMED 3 19,717 44,338 S EMI E MB 0.486
DIEL 4 4,373,008 4,464,261 P LANETOID -I 0.501
NELL 210 65,755 266,144 ∗
DIEL 0.405

LP 0.162
G RAPH E MB 0.258
3.4. Training P LANETOID -G 0.394
P LANETOID -T 0.500
We adopt stochastic gradient descent (SGD) (Bottou, 2010)

to train our model in the mini-batch mode. We first sample U PPER B OUND 0.617
a batch of labeled instances and take a gradient step to op-
timize the loss function of class label prediction. We then
sample a batch of context (i, c, γ) and take another gradient (Belkin et al., 2006), TSVM (Joachims, 1999), and graph
step to optimize the loss function of context prediction. We embeddings (GraphEmb) (Perozzi et al., 2014). Another
repeat the above procedures for T1 and T2 iterations respec- baseline method, denoted as Feat, is a linear softmax model
tively to approximate the weighting factor λ. Algorithm 2 that takes only the feature vectors x as input. We also de-
illustrates the SGD-based training algorithm for the trans- rive a variant Planetoid-G that learns embeddings to jointly
ductive formulation. Similarly, we can replace p(yi |xi , ei ) predict class labels and graph context without use of feature
with p(yi |xi ) in Ls to obtain the training algorithm for the vectors. The architecture of Planetoid-G is similar to Fig-
inductive formulation. Let θ denote all model parameters. ure 2(a) except that the input feature and the correspond-
We update both embeddings e and parameters θ in trans- ing hidden layers are removed. Among the above meth-
ductive learning, and update only parameters θ in induc- ods, LP, GraphEmb and Planetoid-G do not use the fea-
tive learning. Before the joint training procedure, we apply tures x, while TSVM and Feat do not use the graph A. We
a number of training iterations that optimize the unsuper- include these methods into our experimental settings to bet-
vised loss Lu alone and use the learned embeddings e as ter evaluate our approach. Our preliminary experiments on
initialization for joint training. the text classification datasets show that the performance
of our model is not very sensitive to specific choices of
the network architecture2 . We adapt the implementation
4. Experiments of GraphEmb3 to our Skipgram implementation. We use
In our experiments, Planetoid-T and Planetoid-I denote the Junto library (Talukdar & Crammer, 2009) for label
the transductive and inductive formulation of our approach. 2
We note that it is possible to develop other architectures for
We compare our approach with label propagation (LP) different applications, such as using a shared hidden layer for fea-
(Zhu et al., 2003), semi-supervised embedding (SemiEmb) ture vectors and embeddings.
3
(Weston et al., 2012), manifold regularization (ManiReg) https://github.com/phanein/deepwalk
Revisiting Semi-Supervised Learning with Graph Embeddings

(a) GraphEmb (b) Planetoid-T (c) SemiEmb

Figure 3. t-SNE Visualization of embedding spaces on the Cora dataset. Each color denotes a class.
contains bag-of-words representation of documents and ci-
Table 5. Accuracy on NELL entity classification with labeling
tation links between the documents. We treat the bag-of-
rates of 0.1, 0.01, and 0.001. Upper rows are inductive methods
and lower rows are transductive methods. words as feature vectors x. We construct the graph A based
on the citation links; if document i cites j, then we set
M ETHOD 0.1 0.01 0.001 aij = aji = 1. The goal is to classify each document
F EAT 0.621 0.404 0.217
into one class. We randomly sample 20 instances for each
M ANI R EG 0.634 0.413 0.218 class as labeled data, 1, 000 instances as test data, and the
S EMI E MB 0.654 0.438 0.267 rest are used as unlabeled data. The same data splits are
P LANETOID -I 0.702 0.598 0.454 used for different methods, and we compute the average
LP 0.714 0.448 0.265
accuracy for comparison.
G RAPH E MB 0.795 0.725 0.581 The experimental results are reported in Table 3. Among
P LANETOID -G/T 0.845 0.757 0.619
the inductive methods, Planetoid-I achieves the best perfor-
mance on all the three datasets with the improvement of up
to 6.1% on Pubmed, which indicates that our embedding
propagation, and SVMLight4 for TSVM. We also use our techniques are more effective than graph Laplacian regu-
own implementation of ManiReg and SemiEmb by modi- larization. Among the transductive methods, Planetoid-T
fying the symbolic objective function in Planetoid. In all achieves the best performance on Cora and Pubmed, while
of our experiments, we set the model hyper-parameters to TSVM performs the best on Citeseer. However, TSVM
r1 = 5/6, q = 10, d = 3, N1 = 200 and N2 = 200 for does not perform well on Cora and Pubmed. Planetoid-I
Planetoid. We use the same r1 , q and d for GraphEmb, and slightly outperforms Planetoid-T on Citeseer and Pubmed,
the same N1 and N2 for ManiReg and SemiEmb. We tune while Planetoid-T gets up to 14.5% improvement over
r2 , T1 , T2 , the learning rate and hyper-parameters in other Planetoid-I on Cora. We conjecture that in Planetoid-I, the
models based on an additional data split with a different feature vectors impose constraints on the learned embed-
random seed. dings, since they are represented by a parameterized func-
The statistics for five of our benchmark datasets are re- tion of the input feature vectors. If such constraints are
ported in Table 2. For each dataset, we split all instances appropriate, as is the case on Citeseer and Pubmed, it im-
into three parts, labeled data, unlabeled data, and test data. proves the non-convex optimization of embedding learn-
Inductive methods are trained on the labeled and unlabeled ing and leads to better performance. However, if such
data, and tested on the test data. Transductive methods, on constraints rule out the optimal embeddings, the inductive
the other hand, are trained on the labeled, unlabeled data, model will suffer.
and test data without labels. Planetoid-G consistently outperforms GraphEmb on all
three datasets, which indicates that joint training with la-
4.1. Text Classification bel information can improve the performance over train-
ing the supervised and unsupervised objectives separately.
We first considered three text classification datasets5 , Cite-
Figure 3 displays the 2-D embedding spaces on the Cora
seer, Cora and Pubmed (Sen et al., 2008). Each dataset
dataset using t-SNE (Van der Maaten & Hinton, 2008).
4
http://svmlight.joachims.org/ Note that different classes are better separated in the em-
5
http://linqs.umiacs.umd.edu/projects//projects/lbc/ bedding space of Planetoid-T than that of GraphEmb and
Revisiting Semi-Supervised Learning with Graph Embeddings

SemiEmb, which is consistent with our empirical findings. e2 denote head entity, relation, and tail entity respectively.
We also observe similar results for the other two datasets. We treat each entity e as a node in the graph, and each re-
lation r is split as two nodes r1 and r2 in the graph. For
4.2. Distantly-Supervised Entity Extraction each (e1 , r, e2 ), we add two edges in the graph, (e1 , r1 )
and (e2 , r2 ).
We next considered the DIEL (Distant Information Ex-
traction using coordinate-term Lists) dataset (Bing et al., We removed all classes with less than 10 entities. The goal
2015). The DIEL dataset contains pre-extracted features is to classify the entities in the knowledge base into one of
for each entity mention in text, and a graph that connects the 210 classes given the feature vectors and the graph. Let
entity mentions to coordinate lists. The goal is to extract β be the labeling rate. We set β to 0.1, 0.01, and 0.001.
medical entities from text given feature vectors and the max(βN, 1) instances are labeled for a class with N enti-
graph. ties, so each class has at least one entity in the labeled data.
We follow the exact experimental setup as in the original We report the results in Table 5. We did not include TSVM
DIEL paper (Bing et al., 2015), including data splits of dif- since it does not scale to such a large number of classes
ferent runs, preprocessing of entity mentions and coordi- with the one-vs-rest scheme. Adding feature vectors does
nate lists, and evaluation. We treat the top-k entities given not improve the performance of Planetoid-T, so we set the
by a model as positive instances, and compute recall@k feature vectors for Planetoid-T to be all empty, and there-
for evaluation (k is set to 240, 000 following the DIEL pa- fore Planetoid-T is equivalent to Planetoid-G in this case.
per). We report the average result of 10 runs in Table 4, Planetoid-I significantly outperforms the best of the other
where Feat refers to a result obtained by SVM (referred to compared inductive methods—i.e., SemiEmb—by 4.8%,
as DS-Baseline in the DIEL paper). The result of LP was 16.0%, and 18.7% respectively with three labeling rates.
also taken from (Bing et al., 2015). DIEL in Table 4 refers As the labeling rate decreases, the improvement of
to the method proposed by the original paper, which is an Planetoid-I over SemiEmb becomes more significant.
improved version of label propagation that trains classifiers
on feature vectors based on the output of label propagation. Graph structure is more informative than features in this
We did not include TSVM into the comparison since it does dataset, so inductive methods perform worse than trans-
not scale. Since we use Freebase as ground truth and some ductive methods. Planetoid-G outperforms GraphEmb by
entities are not present in text, the upper bound of recall as 5.0%, 3.2% and 3.8%.
shown in Table 4 is 0.617.
Both Planetoid-I and Planetoid-T significantly outperform 5. Conclusion
all other methods. Each of Planetoid-I and Planetoid-T Our contribution is three-fold: a) incontrast to previous
achieves the best performance in 5 out of 10 runs, and they semi-supervised learning approaches that largely depend
give a similar recall on average, which indicates that there on graph Laplacian regularization, we propose a novel ap-
is no significant difference between these two methods on proach by joint training of classification and graph context
this dataset. Planetoid-G clearly outperforms GraphEmb, prediction; b) since it is difficult to generalize graph em-
which again shows the benefit of joint training. beddings to novel instances, we design a novel inductive
approach that conditions embeddings on input features; c)
4.3. Entity Classification we empirically show substantial improvement over exist-
We sorted out an entity classification dataset from the ing methods (up to 8.5% and on average 4.1%), and even
knowledge base of Never Ending Language Learning more significant improvement in the inductive setting (up
(NELL) (Carlson et al., 2010) and a hierarchical entity clas- to 18.7% and on average 7.8%).
sification dataset (Dalvi & Cohen, 2016) that links NELL Our experimental results on five benchmark datasets also
entities to text in ClueWeb09. We extracted the entities and show that a) joint training gives improvement over unsuper-
the relations between entities from the NELL knowledge vised learning; b) predicting graph context is more effective
base, and then obtained text description by linking the en- than graph Laplacian regularization; c) the performance of
tities to ClueWeb09. We use text bag-of-words representa- the inductive variant depends on the informativeness of fea-
tion as feature vectors of the entities. ture vectors.
We next describe how to construct the graph based on One direction of future work would be to apply our frame-
the knowledge base. We first remove relations that work to more complex networks, including recurrent net-
are not populated in NELL, including “generalizations”, works. It would also be interesting to experiment with
“haswikipediaurl”, and “atdate”. In the knowledge base, datasets where a graph is computed based on distances be-
each relation is denoted as a triplet (e1 , r, e2 ), where e1 , r, tween feature vectors.
Revisiting Semi-Supervised Learning with Graph Embeddings

Acknowledgements Pennington, Jeffrey, Socher, Richard, and Manning,


Christopher D. Glove: Global vectors for word repre-
This work was funded by the NSF under grants CCF- sentation. EMNLP, 12:1532–1543, 2014.
1414030 and IIS-1250956, and by Google.
Perozzi, Bryan, Al-Rfou, Rami, and Skiena, Steven. Deep-
References walk: Online learning of social representations. In KDD,
pp. 701–710, 2014.
Belkin, Mikhail, Niyogi, Partha, and Sindhwani, Vikas.
Manifold regularization: A geometric framework for Sen, Prithviraj, Namata, Galileo, Bilgic, Mustafa, Getoor,
learning from labeled and unlabeled examples. JMLR, Lise, Galligher, Brian, and Eliassi-Rad, Tina. Collective
7:2399–2434, 2006. classification in network data. AI magazine, 29(3):93,
2008.
Bing, Lidong, Chaudhari, Sneha, Wang, Richard C, and
Cohen, William W. Improving distant supervision for Snijders, Tom AB and Nowicki, Krzysztof. Estimation and
information extraction using label propagation through prediction for stochastic blockmodels for graphs with la-
lists. In EMNLP, 2015. tent block structure. Journal of classification, 14(1):75–
100, 1997.
Bordes, Antoine, Usunier, Nicolas, Garcia-Duran, Alberto,
Weston, Jason, and Yakhnenko, Oksana. Translating em- Talukdar, Partha Pratim and Crammer, Koby. New regu-
beddings for modeling multi-relational data. In NIPS, larized algorithms for transductive learning. In Machine
pp. 2787–2795, 2013. Learning and Knowledge Discovery in Databases, pp.
442–457. Springer, 2009.
Bottou, Léon. Large-scale machine learning with stochas-
tic gradient descent. In COMPSTAT, pp. 177–186. Tang, Jian, Qu, Meng, Wang, Mingzhe, Zhang, Ming, Yan,
Springer, 2010. Jun, and Mei, Qiaozhu. Line: Large-scale information
network embedding. In WWW, pp. 1067–1077, 2015.
Carlson, Andrew, Betteridge, Justin, Kisiel, Bryan, Settles,
Burr, Hruschka Jr, Estevam R, and Mitchell, Tom M. To- Tian, Fei, Gao, Bin, Cui, Qing, Chen, Enhong, and Liu,
ward an architecture for never-ending language learning. Tie-Yan. Learning deep representations for graph clus-
In AAAI, volume 5, pp. 3, 2010. tering. In AAAI, pp. 1293–1299, 2014.
Collobert, Ronan, Weston, Jason, Bottou, Léon, Karlen, Van der Maaten, Laurens and Hinton, Geoffrey. Visualizing
Michael, Kavukcuoglu, Koray, and Kuksa, Pavel. Nat- data using t-sne. JMLR, 9(2579-2605):85, 2008.
ural language processing (almost) from scratch. JMLR,
12:2493–2537, 2011. Weston, Jason, Ratle, Frédéric, Mobahi, Hossein, and Col-
lobert, Ronan. Deep learning via semi-supervised em-
Dalvi, Bhavana and Cohen, William W. Hierarchical semi- bedding. In Neural Networks: Tricks of the Trade, pp.
supervised classification with incomplete class hierar- 639–655. Springer, 2012.
chies. In WSDM, 2016.
Wijaya, Derry, Talukdar, Partha Pratim, and Mitchell, Tom.
Handcock, Mark S, Raftery, Adrian E, and Tantrum, Pidgin: ontology alignment using web text as interlin-
Jeremy M. Model-based clustering for social networks. gua. In CIKM, pp. 589–598, 2013.
Journal of the Royal Statistical Society: Series A (Statis-
tics in Society), 170(2):301–354, 2007. Yang, Zhilin, Salakhutdinov, Ruslan, and Cohen, William.
Multi-task cross-lingual sequence tagging from scratch.
Ji, Ming, Sun, Yizhou, Danilevsky, Marina, Han, Jiawei,
arXiv preprint arXiv:1603.06270, 2016.
and Gao, Jing. Graph regularized transductive classifi-
cation on heterogeneous information networks. In Ma- Zhou, Dengyong, Bousquet, Olivier, Lal, Thomas Navin,
chine Learning and Knowledge Discovery in Databases, Weston, Jason, and Schölkopf, Bernhard. Learning with
pp. 570–586. Springer, 2010. local and global consistency. NIPS, 16(16):321–328,
2004.
Joachims, Thorsten. Transductive inference for text clas-
sification using support vector machines. In ICML, vol- Zhu, Xiaojin, Ghahramani, Zoubin, Lafferty, John, et al.
ume 99, pp. 200–209, 1999. Semi-supervised learning using gaussian fields and har-
Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, monic functions. In ICML, volume 3, pp. 912–919,
Greg S, and Dean, Jeff. Distributed representations of 2003.
words and phrases and their compositionality. In NIPS,
pp. 3111–3119, 2013.

You might also like