Neural Graph Learning Training Neural Networks Using Graphs
Neural Graph Learning Training Neural Networks Using Graphs
ABSTRACT 1 INTRODUCTION
Label propagation is a powerful and flexible semi-supervised learn- Semi-supervised learning is a powerful machine learning paradigm
ing technique on graphs. Neural networks, on the other hand, have that can improve the prediction performance compared to tech-
proven track records in many supervised learning tasks. In this niques that use only labeled data, by leveraging a large amount
work, we propose a training framework with a graph-regularised of unlabeled data. The need of semi-supervised learning arises in
objective, namely Neural Graph Machines, that can combine the many problems in computer vision, natural language processing or
power of neural networks and label propagation. This work gener- social networks, in which getting labeled datapoints is expensive
alises previous literature on graph-augmented training of neural or unlabeled data is abundant and readily available.
networks, enabling it to be applied to multiple neural architectures There exist a plethora of semi-supervised learning methods. The
(Feed-forward NNs, CNNs and LSTM RNNs) and a wide range of simplest one uses bootstrapping techniques to generate pseudo-
graphs. The new objective allows the neural networks to harness labels for unlabeled data generated from a system trained on labeled
both labeled and unlabeled data by: (a) allowing the network to data. However, this suffers from label error feedbacks [13]. In a
train using labeled data as in the supervised setting, (b) biasing the similar vein, autoencoder based methods often need to rely on a
network to learn similar hidden representations for neighboring two-stage approach: train an autoencoder using unlabeled data to
nodes on a graph, in the same vein as label propagation. Such ar- generate an embedding mapping, and use the learnt embeddings for
chitectures with the proposed objective can be trained efficiently prediction. In practice, this procedure is often costly and inaccurate.
using stochastic gradient descent and scaled to large graphs, with Another example is transductive SVMs [8], which is too computa-
a runtime that is linear in the number of edges. The proposed joint tionally expensive to be used for large datasets. Methods that are
training approach convincingly outperforms many existing meth- based on generative models and amortized variational inference
ods on a wide range of tasks (multi-label classification on social [10] can work well for images and videos, but it is not immedi-
graphs, news categorization, document classification and semantic ately clear on how to extend such techniques to handle sparse and
intent classification), with multiple forms of graph inputs (including multi-modal inputs or graphs over the inputs.
graphs with and without node-level features) and using different In contrast to the methods above, graph-based techniques such as
types of neural networks. label propagation [4, 23] often provide a versatile, scalable, and yet
effective solution to a wide range of problems. These methods con-
CCS CONCEPTS struct a smooth graph over the unlabeled and labeled data. Graphs
• Computing methodologies → Neural networks; Semi-supervised are also often a natural way to describe the relationships between
learning settings; nodes, such as similarities between embeddings, phrases or images,
or connections between entities on the web or relations in a social
network. Edges in the graph connect semantically similar nodes
KEYWORDS or datapoints, and if present, edge weights reflect how strong such
semi-supervised learning, neural network, graph similarities are. By providing a set of labeled nodes, such techniques
iteratively refine the node labels by aggregating information from
ACM Reference Format:
Thang D. Bui, Sujith Ravi, and Vivek Ramavajjala. 2018. Neural Graph neighbours and propagate these labels to the nodes’ neighbours. In
Learning: Training Neural Networks Using Graphs. In WSDM 2018: WSDM practice, these methods often converge quickly and can be scaled
2018: The Eleventh ACM International Conference on Web Search and Data to large datasets with a large label space [15]. We build upon the
Mining , February 5–9, 2018, Marina Del Rey, CA, USA. ACM, New York, NY, principle behind label propagation for our method.
USA, 8 pages. https://doi.org/10.1145/3159652.3159731 Another key motivation of our work is the recent advances in
neural networks and their performance on a wide variety of su-
∗ This pervised learning tasks such as image and speech recognition or
work was done during an internship at Google.
sequence-to-sequence learning [7, 12, 17]. Such results are how-
ever conditioned on training very large networks on large datasets,
Permission to make digital or hard copies of part or all of this work for personal or which may need millions of labeled training input-output pairs.
classroom use is granted without fee provided that copies are not made or distributed This begs the question: can we harness previous state-of-the-art
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. semi-supervised learning techniques, to jointly train neural net-
For all other uses, contact the owner/author(s). works using limited labeled data and unlabeled data to improve its
WSDM 2018, February 5–9, 2018, Marina Del Rey, CA, USA performance?
© 2018 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-5581-0/18/02.
https://doi.org/10.1145/3159652.3159731
WSDM 2018, February 5–9, 2018, Marina Del Rey, CA, USA Thang D. Bui, Sujith Ravi, and Vivek Ramavajjala
Contributions: We propose a discriminative training objective for available, can be treated using this objective, or if extra information
neural networks with graph augmentation, that can be trained with about the training set, such as relational structures can be used.
stochastic gradient descent and efficiently scaled to large graphs.
The new objective has a regularization term for generic neural 2.2 Graph-based semi-supervised learning
network architectures that enforces similarity between nodes in In this section, we provide a concise introduction to graph-based
the graphs, which is inspired by the objective function of label semi-supervised learning using label propagation and its training
propagation. In particular, we show that: objective. Suppose we are given a graph G = (V , E,W ) where V is
• Graph-augmented neural network training can work for a the set of nodes, E the set of edges and W the edge weight matrix.
wide range of neural networks, such as feed-forward, convo- Let Vl , Vu be the labeled and unlabeled nodes in the graph. The
lutional and recurrent networks. Additionally, this technique goal is to predict a soft assignment of labels for each node in the
can be used in both inductive and transductive settings. It graph, Ŷ , given the training label distribution for the seed nodes, Y .
also helps learning in low-sample regime (small number of Mathematically, label propagation performs minimization of the
labeled nodes), which cannot be handled by vanilla neural following convex objective function, for L labels,
network training. X 2
• The framework can handle multiple forms of graphs, either CLP (Ŷ ) = µ 1 Ŷv − Yv
2
naturally given or constructed based on embeddings and v ∈Vl
knowledge bases. X 2
+ µ2 wu,v Ŷv − Ŷu
• As a by-product, our proposed framework provides a simple 2
v ∈V ,u ∈N (v )
technique to finding smaller and faster neural networks that X 2
offer competitive performance with larger and slower non + µ3 Ŷv − U , (2)
2
graph-augmented alternatives (see section 4.2). v ∈V
We experimentally show that the proposed training framework subject to lL=1 Ŷvl = 1, where N (v) is the neighbour node set
P
outperforms state-of-the-art or perform favourably on a variety of of the node v, and U is the prior distribution over all labels, wu,v
prediction tasks and datasets, involving text features and/or graph is the edge weight between nodes u and v, and µ 1 , µ 2 , and µ 3
inputs and on many different neural network architectures (see are hyperparameters that balance the contribution of individual
section 4). terms in the objective. The terms in the objective function above
The paper is organized as follows: we first review some back- encourage that: (a) the label distribution of seed nodes should be
ground and literature, and relate them to our approach in section 2; close to the ground truth, (b) the label distribution of neighbouring
we then detail the training objective and its properties in section 3; nodes should be similar, and, (c) if relevant, the label distribution
and finally we validate our approach on a range of experiments in should stay close to our prior belief. This objective function can
section 4. be solved efficiently using iterative methods such as the Jacobi
procedure. That is, in each step, each node aggregates the label
2 BACKGROUND AND RELATED WORKS distributions from its neighbours and adjusts its own distribution,
In this section, we will lay out the groundwork for our proposed which is then repeated until convergence. In practice, the iterative
training objective in section 3. updates can be done in parallel or in a distributed fashion which
then allows large graphs with a large number of nodes and labels
2.1 Neural network learning to be trained efficiently. [4] and [15] provide good surveys on the
Neural networks are a class of non-linear mapping from inputs to topic for interested readers.
outputs and comprised of multiple layers that can potentially learn There are many variants of label propagation that could be
useful representations for predicting the outputs. We will view vari- viewed as optimising modified versions of eq. (2), and in essence
ous models such as feed-forward neural networks, recurrent neural balancing the smoothness constraint and the fitting constraint [22].
networks and convolutional networks under the same umbrella. For example, manifold regularization [3] replaces the label distri-
N , such neu-
Given a set of N training input-output pairs {x n , yn }n=1 bution Ŷ by a Reproducing Kernel Hilbert Space mapping from
ral networks are often trained by performing maximum likelihood input features. Similarly, [18] also employs such mapping but uses
learning, that is, tuning their parameters so that the networks’ a feed-forward neural network instead. Both methods can be clas-
outputs are close to the ground truth under some criterion, sified as inductive learning algorithms; whereas the original label
X propagation algorithm is transductive [19].
CNN (θ ) = c (дθ (x n ), yn ), (1) These aforementioned methods are closest to our proposed ap-
n proach; however, there are key differences. Our work generalizes
where дθ (·) denotes the overall mapping, parameterized by θ , and previously proposed frameworks for graph-augmented training of
c (·) denotes a loss function such as l-2 for regression or cross en- neural networks (e.g., [18]) and extends it to new settings, for ex-
tropy for classification. The cost function c and the mapping д are ample, when there is only graph input and no features are available.
typically differentiable w.r.t θ , which facilitates optimisation via Unlike the previous works, we show that the graph augmented
gradient descent. Importantly, this can be scaled to a large num- training method can work with multiple neural network architec-
ber of training instances by employing stochastic training using tures (Feed-forward NNs, CNNs, RNNs) and on multiple prediction
minibatches of data. However, it is not clear how unlabeled data, if tasks and datasets using natural as well as constructed graphs. The
Neural Graph Learning: Training Neural Networks Using Graphs WSDM 2018, February 5–9, 2018, Marina Del Rey, CA, USA
experiment results (see section 4) clearly validate the effectiveness propagation cost as follows,
of this method in all these different settings, in both inductive and X
transductive learning paradigms. Besides the methodology, our CNGM (θ ) = α 1 wuv d (hθ (xu ), hθ (xv ))
study also presents an important contribution towards assessing (u,v ) ∈E LL
X
the effectiveness of graph combined neural networks as a generic + α2 wuv d (hθ (xu ), hθ (xv ))
training mechanism for different architectures and problems, which (u,v ) ∈E LU
was not well studied in previous work. X
More recently, graph embedding techniques have been used to + α3 wuv d (hθ (xu ), hθ (xv )
create node embedding that encode local structures of the graph (u,v ) ∈EU U
and the provided node labels [14, 19]. These techniques target Vl
X
learning better node representations to be used for other tasks + c (дθ (x n ), yn ), (3)
such as node classification. In this work, we aim to directly learn n=1
better predictive models from the graph. We compare our method where ELL , ELU , and EU U are sets of labeled-labeled, labeled-
to these two-stage (embedding + classifier) techniques in several unlabeled and unlabeled-unlabeled edges correspondingly, h(·) rep-
experiments in section 4 resents the hidden representations of the inputs produced by the
Our work is also different and orthogonal to recent works on us- neural network, and d (·) is a distance metric, and {α 1 , α 2 , α 3 } are
ing neural networks on graphs, for example: [5, 11] employ spectral hyperparameters. Note that we have separated the terms based on
graph convolution to create a neural-network like classifier. How- the edge types, as these can affect the training differently.
ever, these approaches requires many approximations to arrive at a Our framework is general so that one can plug in either the
practical implementation. Here, we advocate a training objective hidden representations at any intermediate layer or the estimated
that uses graphs to augment neural network learning, and works soft label vector at the final layer. However, similar to any neural
with many forms of graphs and with any type of neural network. network regularisation scheme, it is not obvious what strategy
works best in general. For example, forcing bottom layers (closer
to the inputs) to be similar would have a stronger regularisation
effect, and vice versa. In practice, we choose an l-1 or l-2 distance
metric for d (·), and h(x ) to be the last hidden layer of the neural
3 NEURAL GRAPH MACHINES network, or a cross-entropy cost for the predicted label vector.
In this section, we devise a discriminative training objective for
neural networks, that is inspired bnormy the label propagation 3.1 Connections to previous methods
objective and uses both labeled and unlabeled data, and can be The graph-dependent α hyperparameters control the balance of
trained by stochastic gradient descent. the contributions of different edge types. When {α i = 0}i=1 3 , the
First, we take a close look at the two objective functions discussed proposed objective ignores the similarity constraint and becomes a
in section 2. The label propagation objective (eq. (2)) ensures the supervised-only neural network objective as in eq. (1). When only
predicted label distributions of neighbouring nodes to be similar, α 1 , 0, the training cost has an additional term for labeled nodes,
while those of labeled nodes to be close to the ground truth. For that acts as a regularizer. When дθ (x ) = hθ (x ) = ŷ, where ŷ is the
example: if a cat image and a dog image are strongly connected label distribution, the individual cost functions (c and d) are squared
in a graph, and if the cat node is labeled as animal, the predicted l-2 norm, and the objective is trained using ŷ directly instead of θ ,
probability of the dog node being animal is also high. In contrast, the we arrive at the label propagation objective in eq. (2). Therefore,
neural network training objective (eq. (1)) only takes into account the proposed objective could be thought of as a non-linear version
the labeled instances, and ensure correct predictions on the training of the label propagation objective, and a graph-regularized version
set. As a consequence, a neural network trained on the cat image of the neural network training objective.
alone will not make an accurate prediction on the dog image.
Such shortcoming of neural network training can be rectified by 3.2 Network inputs and graph construction
biasing the network using prior knowledge about the relationship
Similar to graph-based label propagation, the choice of the input
between instances in the dataset. In particular, for the domains we
graphs is critical, to correctly bias the neural network’s prediction.
are interested in, training instances (either labeled or unlabeled)
Depending on the type of the graphs and nodes in the graph, they
that are connected in a graph, for example, dog and cat in the
can be readily available to use such as social networks or protein
above example, should have similar predictions. This can be done
linking networks, or they can be constructed (a) using generic
by encouraging neighboring data points to have a similar hidden
graphs such as Knowledge Bases, that consists of relationship links
representation learnt by a neural network, resulting in a modified
between entities, (b) using embeddings learnt by an unsupervised
objective function for training neural network architectures using
learning technique, or, (c) using sparse feature representations for
both labeled and unlabeled datapoints. We call architectures trained
each vertex. Additionally, the proposed training objective can be
using this objective Neural Graph Machines (NGM), and schemat-
easily modified for directed graphs.
ically illustrate the concept in figure 1. The proposed objective
We have discussed using node-level features as inputs to the
function is a weighted sum of the neural network cost and the label
neural network. In the absence of such inputs, our training scheme
can still be deployed using input features derived from the graph
WSDM 2018, February 5–9, 2018, Marina Del Rey, CA, USA Thang D. Bui, Sujith Ravi, and Vivek Ramavajjala
yi
xi xi xi yi
yj xj
xj xk xk
yj xj
xi yi yi
xj xk xk
xi This sounds good
xj yj yj
xj That looks awesome
C D
Figure 1: A: An example of a graph and feature inputs. In this case, there are two labeled nodes (x i , x j ) and one unlabeled
node (x k ), and two edges. The feature vectors, one for each node, are used as neural network inputs. B, C and D: Illustration
of Neural Graph Machine for feed-forward, convolution and recurrent networks respectively: the training flow ensures the
neural net to make accurate node-level predictions and biases the hidden representations/embeddings of neighbouring nodes
to be similar. In this example, we force hi and h j to be similar as there is an edge connecting x i and x j nodes. E: Illustration of
how we can construct inputs to the neural network using the adjacency matrix. In this example, we have three nodes and two
edges. The feature vector created for each node (shown on the right) has 1’s at its index and indices of nodes that it’s adjacent to.
TensorFlow implementation [1]. Models were trained using multi- Table 1: Macro F1 results for BlogCatalog dataset averaged
ple runs, each experiment was run for a fixed number of time steps over 10 random splits. The higher is better. Graph regular-
and batch size (details described in each section). The observed vari- ized neural networks outperform node2vec embedding and
ance across runs wrt accuracy was small, around ±0.1% Note that a linear classifer in all training size settings.
we did not perform any cross-validation to select the regularisation
loss or which hidden layers to compare. As such, we expect even
better results than ones presented below if there is a more careful |Train| / |Dataset| NGM-FFNN node2vec1
selection in place. 20% 0.191 0.168
50% 0.242 0.174
4.1 Multi-label Classification of Nodes on 80% 0.262 0.177
Graphs
We first demonstrate our approach using a multi-label classifica-
than the two-stage approach considered. More importantly, the
tion problem on nodes in a relationship graph. In particular, the
results also show that using the graph information improves the
BlogCatalog dataset [2], a network of social relationships between
performance in the limited data regime (for example: when training
bloggers is considered. This graph has 10,312 nodes, 333,983 edges
set is only 20% or 50% of the dataset).
and 39 labels per node, which represent the bloggers, their social
connections and the bloggers’ interests, respectively. Following 4.2 Text Classification using Character-level
previous approaches in the literature [2, 6], we train and make
predictions using multiple one-vs-rest classifiers.
CNNs
Since there are no provided features for each node, we use the We evaluate the proposed objective function on a multi-class text
rows of the adjacency matrix as input features, as discussed in sec- classification task using a character-level convolutional neural net-
tion 3.2 in the main text. Feed-forward neural networks (FFNNs) work (CNN). We use the AG news dataset from [21], where the task
with one hidden layer of 50 units are employed to map the con- is to classify a news article into one of 4 categories. Each category
structed inputs to the node labels. As we use the test set to construct has 30,000 examples for training and 1,900 examples for testing. In
the graph and augment the training objective, the training in this addition to the train and test sets, there are 111,469 examples that
experiment is transductive. Critically, to combat the unbalanced are treated as unlabeled examples.
training set, we employ weighted sampling during training, i.e. mak- As there is no provided graph structure linking the articles, we
ing sure each minibatch has both positive and negative examples. create such a graph based on the embeddings of the articles. We
In this experiment, we fix α i to be equal, and experiment with restrict the graph construction to only the train set and the unla-
α = 0.1 and use the l-2 metric to compute the distance d between beled examples and keep the test set only for evaluation. We use the
the hidden representations of the networks. In addition, we create Google News word2vec corpus to calculate the average embedding
a range of train/test splits by varying the number of training points for each news article and use the cosine similarity of document
being presented to the networks. embeddings as a similarity metric. Each node is restricted to have a
We compare our method (NGM-FFNN) against a two-stage ap- maximum of 5 neighbors.
proach that first uses node2vec [6] to generate node embeddings We construct the CNN in the same way as [21] and pick their
and then uses a linear one-vs-rest classifier for classification. The competitive “small CNN” as our baseline for a more reasonable
methods are evaluated using two metrics Macro F1 and Micro F1. comparison to our set-up. Our approach employs the same network,
The average results for different train/test splits using our method but with significantly smaller number of convolutional layers and
and the baseline are included in table 1. In addition, we compare layer sizes, as shown in table 2.
NGM-FFNN with a non-augmented FFNN in which α = 0, i.e. no
edge information is used during training. We observe that the graph- Table 2: Settings of CNNs for the text classification exper-
augmented training scheme performs better (6% relative improve- iment, including the number of convolutional layers and
ment on Macro F1 when the training set size is 20% and 50% of the their sizes. The baseline model is the small CNN from [21]
dataset) or comparatively (when the training size is 80%) compared and is significantly larger than our model.
to the vanilla neural networks trained with no edge information.
Both methods significantly outperform the approach that uses node
embeddings and linear classifiers. We observe the same improve- Setting Baseline Our “tiny CNN”
ment over node2vec on the Micro F1 metric and NGM-FFNN is # of conv. layers 6 3
comparable to vanilla FFNN (α = 0) but outperforms other methods Frame size in conv. layers 256 32
on the recall metric. # of FC layers 3 3
These results demonstrate that using the graph itself as direct Hidden units in FC layers 1024 256
inputs to the neural network and letting the network figure out a
non-linear mapping directly from the raw graph is more effective
The networks are trained with the same hyper-parameters as
1 These results are different compared to [6], since we treat the classifiers (one per label) reported in [21]. We observed that the model converged within 20
independently. Both methods shown here use the exact same setting and training/test epochs (the model loss did not change much) and hence used this
data splits. as a stopping criterion for this task. Experiments also showed that
WSDM 2018, February 5–9, 2018, Marina Del Rey, CA, USA Thang D. Bui, Sujith Ravi, and Vivek Ramavajjala
running the network for longer also did not change the qualitative categories randomly and Frequency baseline ranks them in order
performance. We use the cross entropy loss on the final outputs of their frequency in the training corpus. To evaluate the intent
of the network, that is d = cross_entropy(д(xu ), д(xv )), to com- prediction quality of different approaches, for each test instance, we
pute the distance between nodes on an edge. In addition, we also compute the rank of the actual intent category ranki with respect
experiment with a data augmentation technique using an English to the ranking produced by the method and use this to calculate
PN
thesaurus, as done in [21]. the Mean Reciprocal Rank: MRR = N1 i=1 1
ranki We show in table
We compare the “tiny CNN” trained using the proposed objec- 4 that LSTM RNNs with our proposed graph-augmented training
tive function with the baseline using the accuracy on the test set in objective function outperform standard baselines by achieving a
table 3. Our approach outperforms the baseline by provides a 1.8% better MRR.
absolute and 2.1% relative improvement in accuracy, despite using
a much smaller network. In addition, our model with graph aug- Table 4: Results for Semantic Intent Classification using
mentation trains much faster and produces results on par or better graph-augmented LSTM RNNs and baselines. Higher MRR
than the performance of a significantly larger network, “large CNN" is better.
[21], which has an accuracy of 87.18 without using a thesaurus, and
86.61 with the thesaurus.
Model Mean Reciprocal Rank (MRR)
Table 3: Results for news article categorization using Random 0.175
character-level CNNs. Our method gives better predictive ac- Frequency 0.258
curacy, despite using a much smaller CNN compared to the LSTM 0.276
“small CNN” baseline from [21]‡ . NGM-LSTM 0.284
Network Accuracy %
4.4 Low-supervision Document Classification
Baseline‡ 84.35
Baseline with thesaurus augmentation‡ 85.20 Finally, we compare our method on a task with very limited supervision—
Our “tiny” CNN 85.07 the PubMed document classification problem [16]. The task is to
Our “tiny” CNN with NGM 86.90 classify each document into one of 3 classes, with each document
being described by a TF-IDF weighted word vector. The graph is
available as a citation network: two documents are connected to
each other if one cites the other. The graph has 19,717 nodes and
4.3 Semantic Intent Classification using LSTM 44,338 edges, with each class having 20 seed nodes and 1000 test
RNNs nodes. In our experiments we exclude the test nodes from the graph
entirely, training only on the labeled and unlabeled nodes.
We compare the performance of our approach for training RNN
We train a feed-forward neural network (FFNN) with two hidden
sequence models (LSTM) for a semantic intent classification task as
layers with 250 and 100 neurons, using the l-2 distance metric on the
described in the recent work on SmartReply [9] for automatically
last hidden layer. The NGM-FFNN model is trained with α i = 0.2,
generating short email responses. One of the underlying tasks in
while the baseline FFNN is trained with α i = 0 (i.e., a supervised-
SmartReply is to discover and map short response messages to
only model). We use self-training to train the model, starting with
semantic intent clusters.2 We choose 20 intent classes and created
just the 60 seed nodes (20 per class) as training data. The amount
a dataset comprised of 5,483 samples (3,832 for training, 560 for
of training data is iteratively increased by assigning labels to the
validation and 1,091 for testing). Each sample instance corresponds
immediate neighbors of the labeled nodes and retraining the model.
to a short response message text paired with a semantic intent
For the self-trained NGM-FFNN model, this strategy results in
category that was manually verified by human annotators. For
incrementally growing the neighborhood and thereby, LL and LU
example, “That sounds awesome!” and “Sounds fabulous” belong to
edges in equation 4 objective.
the sounds good intent cluster.
We compare the final NGM-FFNN model against the FFNN base-
We construct a sparse graph in a similar manner as the news
line and other techniques reported in [19] including the Planetoid
categorization task using word2vec embeddings over the message
models [19], semi-supervised embedding [18], manifold regression
text and computing similarity to generate a response message graph
[3], transductive SVM [8], label propagation [24], graph embed-
with fixed node degree (k=10). We use l-2 for the distance metric
dings [14] and a linear softmax model. Full results are included in
d (·) and choose α based on the development set.
table 5. The results show that the NGM model (without any tuning)
We run the experiments for a fixed number of time steps and
outperforms many baselines including FFNN, semi-supervised em-
pick the best results on the development set. A multilayer LSTM
bedding, manifold regularization and Planetoid-G/Planetoid-T, and
architecture (2 layers, 100 dimensions) is used for the RNN sequence
compares favorably to Planetoid-I. Most importantly, this result
model. The LSTM model and its NGM variant are also compared
demonstrates the graph augmentation scheme can lead to better
against other baseline systems—Random baseline ranks the intent
regularised neural networks, especially in low sample regime (20
2 Fordetails regarding SmartReply and how the semantic intent clusters are generated, samples per class in this case). We believe that with tuning, NGM
refer [9]. accuracy can be improved even further.
Neural Graph Learning: Training Neural Networks Using Graphs WSDM 2018, February 5–9, 2018, Marina Del Rey, CA, USA
Table 5: Results for document classification on the PubMed using graph regularisation for different hidden layers of the neural
dataset using neural networks. The top results are taken networks; we expect this is key for the multi-graph transfer setting
from [19]. The bottom two rows are ours, with the [20]. Another possible future extension is to use our objective on
NGM training outperforming all other baselines, except directed graphs, that is to control the direction of influence between
Planetoid-I. Please see text for relevant references. nodes during training.
Method Accuracy
ACKNOWLEDGMENTS
The authors would like to thank the Google Expander team for
Linear + Softmax 0.698 insightful feedback.
Semi-supervised embedding 0.711
Manifold regularization 0.707
Transductive SVM 0.622
REFERENCES
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Label propagation 0.630 Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San-
Graph embedding 0.653 jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven-
Planetoid-I 0.772 berg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike
Planetoid-G 0.664 Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
Planetoid-T 0.757 Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals,
Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
Feed-forward NN 0.709 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.
(2015). http://tensorflow.org/ Software available from tensorflow.org.
NGM-FFNN 0.759 [2] Nitin Agarwal, Huan Liu, Sudheendra Murthy, Arunabha Sen, and Xufei Wang.
2009. A Social Identity Approach to Identify Familiar Strangers in a Social
Network.. In 3rd International AAAI Conference on Weblogs and Social Media
(ICWSM09).
5 CONCLUSIONS [3] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. 2006. Manifold regulariza-
tion: A geometric framework for learning from labeled and unlabeled examples.
We have revisited graph-augmentation training of neural networks Journal of machine learning research 7, Nov (2006), 2399–2434.
and proposed Neural Graph Machines as a general framework for [4] Yoshua Bengio, Olivier Delalleau, and Nicolas Le Roux. 2006. Label propagation
and quadratic criterion. In Semi-supervised learning, O Chapelle, B Scholkopf,
doing so. Its objective function encourages the neural networks to and A Zien (Eds.). MIT Press, 193–216.
make accurate node-level predictions, as in vanilla neural network [5] Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convolu-
training, as well as constrains the networks to learn similar hid- tional neural networks on graphs with fast localized spectral filtering. In Advances
in Neural Information Processing Systems. 3837–3845.
den representations for nodes connected by an edge in the graph. [6] Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable Feature Learning
Importantly, the objective can be trained by stochastic gradient for Networks. In Proceedings of the 22nd ACM SIGKDD International Conference
descent and scaled to large graphs. on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17,
2016. 855–864.
We validated the efficacy of the graph-augmented objective on [7] Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed,
various tasks including bloggers’ interest, text category and se- Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N
Sainath, et al. 2012. Deep neural networks for acoustic modeling in speech
mantic intent classification problems, using a wide range of neural recognition: The shared views of four research groups. IEEE Signal Processing
network architectures (FFNNs, CNNs and LSTM RNNs). The experi- Magazine 29, 6 (2012), 82–97.
mental results demonstrated that graph-augmented training almost [8] Thorsten Joachims. 1999. Transductive inference for text classification using
support vector machines. In International Conference on Machine Learning.
always helps to find better neural networks that outperforms other [9] Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins,
techniques in predictive performance or even much smaller net- Balint Miklos, Greg Corrado, Laszlo Lukacs, Marina Ganea, Peter Young, and
works that are faster and easier to train. Additionally, the node-level Vivek Ramavajjala. 2016. Smart Reply: Automated Response Suggestion for Email.
In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data
input features can be combined with graph features as inputs to Mining (KDD).
the neural networks. We showed that a neural network that simply [10] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling.
2014. Semi-supervised learning with deep generative models. In Advances in
takes the adjacency matrix of a graph and produces node labels, Neural Information Processing Systems. 3581–3589.
can perform better than a recently proposed two-stage approach [11] Thomas N Kipf and Max Welling. 2016. Semi-Supervised Classification with
using sophisticated graph embeddings and a linear classifier. Our Graph Convolutional Networks. arXiv preprint arXiv:1609.02907 (2016).
[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-
framework also excels when the neural network is small, or when tion with deep convolutional neural networks. In Advances in Neural Information
there is limited supervision available. We note that though overall Processing Systems. 1097–1105.
complexity is linear in the number of edges in the graph, in practice [13] Dong-Hyun Lee. 2013. Pseudo-Label: The Simple and Efficient Semi-Supervised
Learning Method for Deep Neural Networks. In ICML 2013 Workshop : Challenges
NGM is more robust compared to the standard training method in Representation Learning (WREPL).
(without regularisation) and that it can converge to a better solu- [14] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. 2014. Deepwalk: Online learning
of social representations. In Proceedings of the 20th ACM SIGKDD international
tion given a fixed time budget. We attribute this effect to the graph conference on Knowledge discovery and data mining. ACM, 701–710.
structure used for optimization within each mini-batch rather than [15] Sujith Ravi and Qiming Diao. 2016. Large Scale Distributed Semi-Supervised
individual training examples in baseline networks. Learning Using Streaming Approximation. In Proceedings of the 19th International
Conference on Artificial Intelligence and Statistics. 519–528.
While our objective can be applied to multiple graphs which [16] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and
come from different domains, we have not fully explored this as- Tina Eliassi-Rad. 2008. Collective classification in network data. AI magazine 29,
pect and leave this as future work. We expect the domain-specific 3 (2008), 93.
[17] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning
networks can interact with the graphs to determine the importance with neural networks. In Advances in Neural Information Processing Systems. 3104–
of each domain/graph source in prediction. We also did not explore 3112.
WSDM 2018, February 5–9, 2018, Marina Del Rey, CA, USA Thang D. Bui, Sujith Ravi, and Vivek Ramavajjala
[18] Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. 2012. Deep
learning via semi-supervised embedding. In Neural Networks: Tricks of the Trade.
Springer, 639–655.
[19] Zhilin Yang, William Cohen, and Ruslan Salakhudinov. 2016. Revisiting Semi-
Supervised Learning with Graph Embeddings. In Proceedings of The 33rd Interna-
tional Conference on Machine Learning. 40–48.
[20] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How transfer-
able are features in deep neural networks?. In Advances in Neural Information
Processing Systems. 3320–3328.
[21] Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional
networks for text classification. In Advances in Neural Information Processing
Systems. 649–657.
[22] Denny Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard
Schölkopf. 2004. Learning with local and global consistency. In Advances in
Neural Information Processing Systems. 321–328.
[23] Xiaojin Zhu and Zoubin Ghahramani. [n. d.]. Learning from labeled and unlabeled
data with label propagation. Technical Report. School of Computer Science,
Canegie Mellon University.
[24] X Zhu, Z Ghahramani, and J Lafferty. 2003. Semi-supervised learning using
Gaussian fields and harmonic functions. In Proceedings of the 20th International
Conference on Machine Learning (ICML-2003) Volume 2, Vol. 2. AIAA Press, 912–
919.