2016-Revisiting Semi-Supervised Learning With Graph Embeddings
2016-Revisiting Semi-Supervised Learning With Graph Embeddings
embeddings as a parameterized function of input feature Graph-based semi-supervised learning is based on the as-
vectors; i.e., the embeddings can be viewed as hidden lay- sumption that nearby nodes tend to have the same la-
ers of a neural network. bels. Generally, the loss function of graph-based semi-
supervised learning in the binary case can be written as
To demonstrate the effectiveness of our proposed approach,
we conducted experiments on five datasets for three tasks, L
X X
including text classification, distantly supervised entity ex- l(yi , f (xi )) + λ aij kf (xi ) − f (xj )k2
traction, and entity classification. Our inductive method i=1 i,j
√
Table 1. Comparison of various semi-supervised learning algorithms and graph embedding algorithms. means using the given for-
mulation or information; ×means not available or not using the information. In the column graph, regularization means imposing
regularization with the graph structure; features means using graph structure as features; context means predicting the graph context.
(Snijders & Nowicki, 1997). A clustering method (Hand- coders were used to learn graph embeddings for clustering
cock et al., 2007) was proposed to learn latent social states on graphs (Tian et al., 2014).
in a social network to predict social ties.
More recently, a number of embedding learning methods 2.4. Comparison
are based on the Skipgram model, which is a variant of We compare our approach in this paper with other methods
the softmax model. Given an instance and its context, the in semi-supervised learning and embedding learning in Ta-
objective of Skipgram is usually formulated as minimizing ble 1. Unlike our approach, conventional graph Laplacian
the log loss of predicting the context using the embedding based methods (Zhu et al., 2003; Belkin et al., 2006; Taluk-
of an instance as input features. Formally, let {(i, c)} be a dar & Crammer, 2009) impose regularization on the labels
set of pairs of instance i and context c, the loss function can but do not learn embeddings. Semi-supervised embedding
be written as method (Weston et al., 2012) learns embeddings in a neural
! network, but our approach is different from this method in
that instead of imposing regularization, we use the embed-
X X X
T T
− log p(c|i) = − wc ei − log exp(wc0 ei )
(i,c) (i,c) c0 ∈C dings to predict the context in the graph. Graph embedding
(2) methods (Perozzi et al., 2014; Tian et al., 2014) encode the
where C is the set of all possible context, w’s are param- graph structure into embeddings; however, different from
eters of the Skipgram model, and ei is the embedding of our approach, these methods are purely unsupervised and
instance i. Skipgram was first introduced to learn repre- do not leverage label information for a specific task. More-
sentations of words, known as word2vec (Mikolov et al., over, these methods are transductive and cannot be directly
2013). In word2vec, for each training pair (i, c), the in- generalized to instances unseen at training time.
stance i is the current word whose embedding is under es-
timation; the context c is each of the surrounding words 3. Semi-Supervised Learning with Graph
of i within a fixed window size in a sentence; the context Embeddings
space C is the vocabulary of the corpus. Skipgram was later
extended to learn graph embeddings. Deepwalk (Perozzi Following the notations in the previous section, the input
et al., 2014) uses the embedding of a node to predict the to our method includes labeled instances x1:L , y1:L , unla-
context in the graph, where the context is generated by ran- beled instances xL+1:L+U and a graph denoted as a matrix
dom walk. More specifically, for each training pair (i, c), A. Each instance i has an embedding denoted as ei .
the instance i is the current node whose embedding is un-
We formulate our framework based on feed-forward neural
der estimation; the context c is each of the neighbor nodes
networks. Given the input feature vector x, the k-th hidden
within a fixed window size in a generated random walk se-
layer of the network is denoted as hk , which is a nonlin-
quence; the context space C is all the nodes in the graph.
ear function of the previous hidden layer hk−1 defined as:
LINE (Tang et al., 2015) extends the model to have mul-
hk (x) = ReLU(Wk hk−1 (x) + bk ), where Wk and bk
tiple context spaces C for modeling both first and second
are parameters of the k-th layer, and h0 (x) = x. We adopt
order proximity.
rectified linear unit ReLU(x) = max(0, x) as the nonlinear
Although Skipgram-like models for graphs have received function in this work.
much recent attention, many other models exist. TransE
The loss function of our framework can be expressed as
(Bordes et al., 2013) learns the embeddings of entities in
a knowledge graph jointly with their relations. Autoen- Ls + λLu ,
Revisiting Semi-Supervised Learning with Graph Embeddings
where [·, ·] denotes concatenation of two row vectors, the beddings incrementally, which is time consuming and does
super script hT denotes the transpose of vector h, and w not scale (and not inductive essentially).
represents the model parameter. To make the method inductive, the prediction of label y
Combined with Eq. (3), the loss function of transductive should only depend on the input feature vector x. There-
learning is defined as: fore, we define the embedding e as a parameterized func-
tion of feature x, as shown in Figure 2(b). Similar to the
L
1X transductive formulation, we apply k layers on the input
− log p(yi |xi , ei ) − λE(i,c,γ) log σ(γwcT ei ), feature vector x to obtain hk (x). However, rather than
L i=1
using a “free” embedding, we apply l1 layers on the in-
where the first term is defined by Eq. (4), and λ is a con- put feature vector x and define it as the embedding e =
stant weighting factor. The first term is the loss function of hl1 (x). Then another l2 layers are applied on the em-
class label prediction and the second term is the loss func- bedding hl2 (e) = hl2 (hl1 (x)), denoted as hl (x) where
tion of context prediction. This formulation is transductive l = l1 + l2 . The embedding e in this formulation can be
because the prediction of label y depends on the embed- viewed as a hidden layer that is a parameterized function of
ding e, which can only be learned for instances observed in the feature x.
the graph A during training time. With the above formulation, the label y only depends on
the feature x. More specifically,
3.3. Inductive Formulation
exp[hk (x)T , hl (x)T ]wy
While we consider transductive learning in the above for- p(y|x) = P k T l T
(5)
y 0 exp[h (x) , h (x) ]wy
0
mulation, in many cases, it is desirable to learn a classi-
fier that can generalize to unobserved instances, especially
for large-scale tasks. For example, machine reading sys- Replacing ei in Eq. (3) with hl1 (xi ), the loss function of
tems (Carlson et al., 2010) very frequently encounter novel inductive learning is
entities on the Web and it is not practical to train a semi-
L
supervised learning system on the entire Web. However, 1X
− log p(yi |xi ) − λE(i,c,γ) log σ(γwcT hl1 (xi ))
since learning graph embeddings is transductive in nature, L i=1
it is not straightforward to do it in an inductive setting. Per-
ozzi et al. (2014) addressed this issue by retraining the em- where the first term is defined by Eq. (5).
Revisiting Semi-Supervised Learning with Graph Embeddings
Figure 3. t-SNE Visualization of embedding spaces on the Cora dataset. Each color denotes a class.
contains bag-of-words representation of documents and ci-
Table 5. Accuracy on NELL entity classification with labeling
tation links between the documents. We treat the bag-of-
rates of 0.1, 0.01, and 0.001. Upper rows are inductive methods
and lower rows are transductive methods. words as feature vectors x. We construct the graph A based
on the citation links; if document i cites j, then we set
M ETHOD 0.1 0.01 0.001 aij = aji = 1. The goal is to classify each document
F EAT 0.621 0.404 0.217
into one class. We randomly sample 20 instances for each
M ANI R EG 0.634 0.413 0.218 class as labeled data, 1, 000 instances as test data, and the
S EMI E MB 0.654 0.438 0.267 rest are used as unlabeled data. The same data splits are
P LANETOID -I 0.702 0.598 0.454 used for different methods, and we compute the average
LP 0.714 0.448 0.265
accuracy for comparison.
G RAPH E MB 0.795 0.725 0.581 The experimental results are reported in Table 3. Among
P LANETOID -G/T 0.845 0.757 0.619
the inductive methods, Planetoid-I achieves the best perfor-
mance on all the three datasets with the improvement of up
to 6.1% on Pubmed, which indicates that our embedding
propagation, and SVMLight4 for TSVM. We also use our techniques are more effective than graph Laplacian regu-
own implementation of ManiReg and SemiEmb by modi- larization. Among the transductive methods, Planetoid-T
fying the symbolic objective function in Planetoid. In all achieves the best performance on Cora and Pubmed, while
of our experiments, we set the model hyper-parameters to TSVM performs the best on Citeseer. However, TSVM
r1 = 5/6, q = 10, d = 3, N1 = 200 and N2 = 200 for does not perform well on Cora and Pubmed. Planetoid-I
Planetoid. We use the same r1 , q and d for GraphEmb, and slightly outperforms Planetoid-T on Citeseer and Pubmed,
the same N1 and N2 for ManiReg and SemiEmb. We tune while Planetoid-T gets up to 14.5% improvement over
r2 , T1 , T2 , the learning rate and hyper-parameters in other Planetoid-I on Cora. We conjecture that in Planetoid-I, the
models based on an additional data split with a different feature vectors impose constraints on the learned embed-
random seed. dings, since they are represented by a parameterized func-
The statistics for five of our benchmark datasets are re- tion of the input feature vectors. If such constraints are
ported in Table 2. For each dataset, we split all instances appropriate, as is the case on Citeseer and Pubmed, it im-
into three parts, labeled data, unlabeled data, and test data. proves the non-convex optimization of embedding learn-
Inductive methods are trained on the labeled and unlabeled ing and leads to better performance. However, if such
data, and tested on the test data. Transductive methods, on constraints rule out the optimal embeddings, the inductive
the other hand, are trained on the labeled, unlabeled data, model will suffer.
and test data without labels. Planetoid-G consistently outperforms GraphEmb on all
three datasets, which indicates that joint training with la-
4.1. Text Classification bel information can improve the performance over train-
ing the supervised and unsupervised objectives separately.
We first considered three text classification datasets5 , Cite-
Figure 3 displays the 2-D embedding spaces on the Cora
seer, Cora and Pubmed (Sen et al., 2008). Each dataset
dataset using t-SNE (Van der Maaten & Hinton, 2008).
4
http://svmlight.joachims.org/ Note that different classes are better separated in the em-
5
http://linqs.umiacs.umd.edu/projects//projects/lbc/ bedding space of Planetoid-T than that of GraphEmb and
Revisiting Semi-Supervised Learning with Graph Embeddings
SemiEmb, which is consistent with our empirical findings. e2 denote head entity, relation, and tail entity respectively.
We also observe similar results for the other two datasets. We treat each entity e as a node in the graph, and each re-
lation r is split as two nodes r1 and r2 in the graph. For
4.2. Distantly-Supervised Entity Extraction each (e1 , r, e2 ), we add two edges in the graph, (e1 , r1 )
and (e2 , r2 ).
We next considered the DIEL (Distant Information Ex-
traction using coordinate-term Lists) dataset (Bing et al., We removed all classes with less than 10 entities. The goal
2015). The DIEL dataset contains pre-extracted features is to classify the entities in the knowledge base into one of
for each entity mention in text, and a graph that connects the 210 classes given the feature vectors and the graph. Let
entity mentions to coordinate lists. The goal is to extract β be the labeling rate. We set β to 0.1, 0.01, and 0.001.
medical entities from text given feature vectors and the max(βN, 1) instances are labeled for a class with N enti-
graph. ties, so each class has at least one entity in the labeled data.
We follow the exact experimental setup as in the original We report the results in Table 5. We did not include TSVM
DIEL paper (Bing et al., 2015), including data splits of dif- since it does not scale to such a large number of classes
ferent runs, preprocessing of entity mentions and coordi- with the one-vs-rest scheme. Adding feature vectors does
nate lists, and evaluation. We treat the top-k entities given not improve the performance of Planetoid-T, so we set the
by a model as positive instances, and compute recall@k feature vectors for Planetoid-T to be all empty, and there-
for evaluation (k is set to 240, 000 following the DIEL pa- fore Planetoid-T is equivalent to Planetoid-G in this case.
per). We report the average result of 10 runs in Table 4, Planetoid-I significantly outperforms the best of the other
where Feat refers to a result obtained by SVM (referred to compared inductive methods—i.e., SemiEmb—by 4.8%,
as DS-Baseline in the DIEL paper). The result of LP was 16.0%, and 18.7% respectively with three labeling rates.
also taken from (Bing et al., 2015). DIEL in Table 4 refers As the labeling rate decreases, the improvement of
to the method proposed by the original paper, which is an Planetoid-I over SemiEmb becomes more significant.
improved version of label propagation that trains classifiers
on feature vectors based on the output of label propagation. Graph structure is more informative than features in this
We did not include TSVM into the comparison since it does dataset, so inductive methods perform worse than trans-
not scale. Since we use Freebase as ground truth and some ductive methods. Planetoid-G outperforms GraphEmb by
entities are not present in text, the upper bound of recall as 5.0%, 3.2% and 3.8%.
shown in Table 4 is 0.617.
Both Planetoid-I and Planetoid-T significantly outperform 5. Conclusion
all other methods. Each of Planetoid-I and Planetoid-T Our contribution is three-fold: a) incontrast to previous
achieves the best performance in 5 out of 10 runs, and they semi-supervised learning approaches that largely depend
give a similar recall on average, which indicates that there on graph Laplacian regularization, we propose a novel ap-
is no significant difference between these two methods on proach by joint training of classification and graph context
this dataset. Planetoid-G clearly outperforms GraphEmb, prediction; b) since it is difficult to generalize graph em-
which again shows the benefit of joint training. beddings to novel instances, we design a novel inductive
approach that conditions embeddings on input features; c)
4.3. Entity Classification we empirically show substantial improvement over exist-
We sorted out an entity classification dataset from the ing methods (up to 8.5% and on average 4.1%), and even
knowledge base of Never Ending Language Learning more significant improvement in the inductive setting (up
(NELL) (Carlson et al., 2010) and a hierarchical entity clas- to 18.7% and on average 7.8%).
sification dataset (Dalvi & Cohen, 2016) that links NELL Our experimental results on five benchmark datasets also
entities to text in ClueWeb09. We extracted the entities and show that a) joint training gives improvement over unsuper-
the relations between entities from the NELL knowledge vised learning; b) predicting graph context is more effective
base, and then obtained text description by linking the en- than graph Laplacian regularization; c) the performance of
tities to ClueWeb09. We use text bag-of-words representa- the inductive variant depends on the informativeness of fea-
tion as feature vectors of the entities. ture vectors.
We next describe how to construct the graph based on One direction of future work would be to apply our frame-
the knowledge base. We first remove relations that work to more complex networks, including recurrent net-
are not populated in NELL, including “generalizations”, works. It would also be interesting to experiment with
“haswikipediaurl”, and “atdate”. In the knowledge base, datasets where a graph is computed based on distances be-
each relation is denoted as a triplet (e1 , r, e2 ), where e1 , r, tween feature vectors.
Revisiting Semi-Supervised Learning with Graph Embeddings