Graph Convolutional Matrix Completion
Graph Convolutional Matrix Completion
Graph Convolutional Matrix Completion
Users
GAE ?
0 0 5 0 5
?
0 0 0 4
4
2
0 0 2 0 ?
Figure 1: Left: Rating matrix M with entries that correspond to user-item interactions (ratings between 1-5)
or missing observations (0). Right: User-item interaction graph with bipartite structure. Edges correspond
to interaction events, numbers on edges denote the rating a user has given to a particular item. The matrix
completion task (i.e. predictions for unobserved interactions) can be cast as a link prediction problem and modeled
using an end-to-end trainable graph auto-encoder.
In an equivalent picture, matrix completion or recom- and 2) a pairwise decoder model A = g(Z), which
mendation can be cast as a link prediction problem on takes pairs of node embeddings (zi , zj ) and predicts
a bipartite user-item interaction graph. More precisely, respective entries Aij in the adjacency matrix. Note
the interaction data can be represented by an undi- that N denotes the number of nodes, D the number of
rected graph G = (W, E, R) with entities consisting of input features, and E the embedding size.
a collection of user nodes ui U with i {1, ..., Nu },
For bipartite recommender graphs G = (W, E, R),
and item nodes vj V with j {1, ..., Nv }, such
we can reformulate the encoder as [U, V ] =
that U V = W. The edges (ui , r, vj ) E carry
f (X, M1 , . . . , MR ), where Mr {0, 1}Nu Nv is the ad-
labels that represent ordinal rating levels, such as
jacency matrix associated with rating type r R, such
r {1, ..., R} = R. This connection was previously
that Mr contains 1s for those elements for which the
explored in [18] and led to the development of graph-
original rating matrix M contains observed ratings with
based methods for recommendation.
value r. U and V are now matrices of user and item
Previous graph-based approaches for recommender sys- embeddings with shape Nu E and Nv E, respec-
tems (see [18] for an overview) typically employ a multi- tively. A single user (item) embedding takes the form
stage pipeline, consisting of a graph feature extraction of a real-valued vector Ui,: (Vj,: ) for user i (item j).
model and a link prediction model, all of which are
Analogously, we can reformulate the decoder as M =
trained separately. Recent results, however, have shown
g(U, V ), i.e. as a function acting on the user and item
that results can often be significantly improved by mod-
embeddings and returning a (reconstructed) rating
eling graph-structured data with end-to-end learning
matrix M of shape Nu Nv . We can train this graph
techniques [2, 6, 19, 23, 5, 15, 21] and specifically with
auto-encoder by minimizing the reconstruction error
graph auto-encoders [30, 14] for unsupervised learning
between the predicted ratings in M and the observed
and link prediction. In what follows, we introduce a
ground-truth ratings in M . Examples of metrics for
specific variant of graph auto-encoders for the task of
the reconstruction error are the root mean square error,
recommendation.
or the cross entropy when treating the rating levels as
different classes.
2.1 Graph auto-encoders
We shall note at this point that several recent state-
We revisit graph auto-encoders which were originally of-the-art models for matrix completion [17, 27, 7, 32]
introduced in [30, 14] as an end-to-end model for un- can be cast into this framework and understood as a
supervised learning [30] and link prediction [14] on special case of our model. An overview of these models
undirected graphs. We specifically consider the setup is provided in Section 3.
introduced in [14], as it makes efficient use of (convolu-
tional) weight sharing and allows for inclusion of side 2.2 Graph convolutional encoder
information in the form of node features. Graph auto-
encoders are comprised of 1) a graph encoder model In what follows, we propose a particular choice of en-
Z = f (X, A), which take as input an N D feature ma- coder model that makes efficient use of weight sharing
trix X and a graph adjacency matrix A, and produce across locations in the graph and that assigns separate
an N E node embedding matrix Z = [z1T , . . . , zN T T
] , processing channels for each edge type (or rating type)
5
Users 1
?
?
Predicted
?
4 ratings
3
Items 5
M
Bilinear
X, M Graph encoder U, V decoder
Figure 2: Schematic of a forward-pass through the GC-MC model, which is comprised of a graph convolutional
encoder [U, V ] = f (X, M1 , . . . , MR ) that passes and transforms messages from user to item nodes, and vice versa,
followed by a bilinear decoder model that predicts entries of the (reconstructed) rating matrix M = g(U, V ),
based on pairs of user and item embeddings.
r R. The form of weight sharing is inspired by a transform the intermediate output hi as follows:
recent class of convolutional neural networks that op-
ui = (W hi ) . (3)
erate directly on graph-structured data [2, 6, 5, 15], in
the sense that the graph convolutional layer performs The item embedding vi is calculated analogously with
local operations that only take the first-order neigh- the same parameter matrix W . In the presence of
borhood of a node into account, whereby the same user- and item-specific side information we use separate
transformation is applied across all locations in the parameter matrices for user and item embeddings. We
graph. will refer to (2) as a graph convolution layer and to (3) as
This type of local graph convolution can be seen as a dense layer. Note that deeper models can be built by
a form of message passing [4, 8], where vector-valued stacking several layers (in arbitrary combinations) with
messages are being passed and transformed across edges appropriate activation functions. In initial experiments,
of the graph. In our case, we can assign a specific we found that stacking multiple convolutional layers did
transformation for each rating level, resulting in edge- not improve performance and a simple combination of
type specific messages ji,r from items j to users i a convolutional layer followed by a dense layer worked
of the following form: best.
It is worth mentioning that the model demonstrated
1 here is only one particular possible, yet relatively sim-
ji,r = Wr xj . (1)
cij ple choice of an encoder, and other variations are po-
tentially worth exploring. Instead of a simple linear
Here, cij is a normalization constant, which p
we choose message transformation, one could explore variations
to either be |Ni | (left normalization) or |Ni ||Nj | where ji,r = nn(xi , xj , r) is a neural network in
(symmetric normalization) with Ni denoting the set itself. Instead of choosing a specific normalization con-
of neighbors of node i. Wr is an edge-type specific stant for individual messages, such as done here, one
parameter matrix and xj is the (initial) feature vector could employ some form of attention mechanism, where
of node j. Messages ij,r from users to items are pro- the individual contribution of each message is learned
cessed in an analogous way. After the message passing and determined by the model.
step, we accumulate incoming messages at every node
by summing over all neighbors Ni,r under a specific 2.3 Bilinear decoder
edge-type r, and by subsequently accumulating them
into a single vector representation: For reconstructing links in the bipartite interaction
X graph we consider a bilinear decoder, and treat each
rating level as a separate class. Indicating the recon-
X
hi = accum ji,1 , . . . , ji,R ,
jNi,1 jNi,R
structed rating between user i and item j with Mij , the
(2) decoder produces a probability distribution over possi-
ble rating levels through a bilinear operation followed
where accum() denotes an accumulation operation, by the application of a softmax function:
such as stack(), i.e. a concatenation of vectors (or ma- T
eui Qr vj
trices along their first dimension), or sum(), i.e. sum- p(Mij = r) = P uT
, (4)
i Qs vj
mation of all messages. () denotes an element-wise sR e
activation function such as the ReLU() = max(0, ). with Qr a trainable parameter matrix of shape E E,
To arrive at the final embedding of user node i, we and E the dimensionality of hidden user (item) repre-
sentations ui (vj ). The predicted rating is computed 2.5 Vectorized implementation
as
X In practice, we can use efficient sparse matrix multipli-
Mij = g(ui , vj ) = Ep(Mij =r) [r] = r p(Mij = r) . cations, with complexity linear in the number of edges,
rR i.e. O(|E|), to implement the graph auto-encoder model.
(5) The graph convolutional encoder (Eq. 3), for example
in the case of left normalization, can be vectorized as
2.4 Model training follows:
Loss function During model training, we minimize U Hu
= f (X, M1 , . . . , MR ) = W T , (7)
the following negative log likelihood of the predicted V Hv
ratings Mij : XR
Hu 1 T
with = D Mr XWr , (8)
X R
X Hv r=1
L= I[r = Mij ] log p(Mij = r) , (6)
0 Mr
i,j;ij =1 r=1 and Mr = . (9)
MrT 0
with I[k = l] = 1 when k = l and zero otherwise.
The matrix {0, 1}Nu Ni serves as a mask for The summation in (8) can be replaced with concatena-
unobserved ratings, such that ones occur for elements tion, similar to (2). In this case D denotes the diagonal
corresponding to observed ratings in M , and zeros node degree matrix with nonzero elements Dii = |Ni |.
for unobserved ratings. Hence, we only optimize over Vectorization for an encoder with a symmetric normal-
observed ratings. ization, as well as vectorization of the bilinear decoder,
follows in an analogous manner. Note that it is only
Node dropout In order for the model to generalize necessary to evaluate observed elements in M , given
well to unobserved ratings, it is trained in a denoising by the mask in Eq. 6.
setup by randomly dropping out all outgoing messages
of a particular node, with a probability pdropout , which 2.6 Input feature representation and side
we will refer to as node dropout. Messages are rescaled information
after dropout as in [28]. In initial experiments we found
that node dropout was more efficient in regularizing Features containing information for each node, such as
than message dropout. In the latter case individual content information, can in principle be injected into
outgoing messages are dropped out independently, mak- the graph encoder directly at the input-level (i.e. in
ing embeddings more robust against the presence or the form of an input feature matrix X). However,
absence of single edges. In contrast, node dropout also when the content information does not contain enough
causes embeddings to be more independent of particu- information to distinguish different users (or items)
lar user or item influences. We furthermore also apply and their interests, feeding the content information
regular dropout [28] to the hidden layer units (3). directly into the graph convolution layer leads to a
severe bottleneck of information flow. In such cases,
Mini-batching We introduce mini-batching by sam- one can include side information in the form of user
pling contributions to the loss function in Eq. (6) from and item feature vectors xfi (for node i) via a separate
different observed ratings. That is, we sample only a processing channel directly into the the dense hidden
fixed number of contributions from the sum over user layer:
and item pairs. By only considering a fixed number
of contributions to the loss function, we can remove ui = (W hi + W2f fi ) with fi = (W1f xfi + b) ,
respective rows of users and items in M1 , ..., MR in (10)
Eq. (7) that do not appear in the current batch. This
serves both as an effective means of regularization, and where W1f and W2f are trainable weight matrices, and
reduces the memory requirement to train the model, b is a bias. The weight matrices and bias vector are
which is necessary to fit the full model for MovieLens- different for users and items. The input feature matrix
10M into GPU memory. We experimentally verified X = [xT1 , . . . , xTN ]T containing the node features for the
that training with mini-batches and full batches leads graph convolution layer is then chosen as an identity
to comparable results for the MovieLens-1M dataset matrix, with a unique one-hot vector for every node in
while adjusting for regularization parameters. For all the graph. For the datasets considered in this paper,
datasets except for the MovieLens-10M, we opt for full- the user (item) content information is of limited size,
batch training since it leads to faster convergence than and we thus opt to include this as side information
training with mini-batches in this particular setting. while using Eq. (10).
In [29], Strub et al. propose to include content infor- rative filtering models that can be seen as a special case
mation along similar lines, although in their case the of our graph auto-encoder model, where only either
proposed model is strictly user- or item-based, and user or item embeddings are considered in the encoder.
thus only supports side information for either users or AutoRec by Sedhain et al. [27] is the first such model,
items. where the users (or items) partially observed rating
vector is projected onto a latent space through an en-
Note that side information does not necessarily need to
coder layer, and reconstructed using a decoder layer
come in the form of per-node feature vectors, but can
with mean squared reconstruction error loss.
also be provided in the form of, e.g., graph-structured,
natural language, or image data. In this case, the dense The CF-NADE algorithm by Zheng et al. [32] can be
layer in (10) is replaced by an appropriate differentiable considered as a special case of the above auto-encoder
module, such as a recurrent neural network, a convolu- architecture. In the user-based setting, messages are
tional neural network, or another graph convolutional only passed from items to users, and in the item-based
network. case the reverse holds. Note that in contrast to our
model, unrated items are assigned a default rating of 3
2.7 Weight sharing in the encoder, thereby creating a fully-connected inter-
action graph. CF-NADE imposes a random ordering on
In the collaborative filtering setting with one-hot vec- nodes, and splits incoming messages into two sets via a
tors as input, the columns of the weight matrices Wr random cut, only one of which is kept. This model can
play the role of latent factors for each separate node therefore be seen as a denoising auto-encoder, where
for one specific rating value r. These latent factors part of the input space is dropped out at random in
are passed onto connected user or item nodes through every iteration.
message passing. However, not all users and items
necessarily have an equal number of ratings for each Factorization models Many of the most popular
rating level. This results in certain columns of Wr to collaborative filtering algorithms fall into the class of
be optimized significantly less frequently than others. matrix factorization (MF) models. Methods of this
Therefore, some form of weight sharing between the sort assume the rating matrix to be well approximated
matrices Wr for different r is desirable to alleviate by a low rank matrix: M U V T , with U RNu k
this optimization problem. Following [32], we therefore and V RNi k , with k Nu , Ni . The rows of U
implement the following weight sharing setup: and V can be seen as latent feature representations
r
X of users and items, representing an encoding for their
Wr = Ts . (11) interests through their rating pattern. Probabilistic
s=1 matrix factorization (PMF) by Salakhutdinov et al.
We will refer to this type of weight sharing as ordinal [20] assumes that the ratings contained in M are in-
weight sharing due to the increasing number of weight dependent stochastic variables with Gaussian noise.
matrices included for higher rating levels. Optimization of the maximum likelihood then leads
one to minimize the mean squared error between the
As an effective means of regularization of the pairwise
observed entries in M and the reconstructed ratings in
bilinear decoder, we resort to weight sharing in the
U V T . BiasedMF by Koren et al. [16] improves upon
form of a linear combination of a set of basis weight
PMF by incorporating a user and item specific bias, as
matrices Ps :
well as a global bias. Neural network matrix factoriza-
nb
X tion (NNMF) [7] extends the MF approach by passing
Qr = ars Ps , (12) the latent user and item features through a feed forward
s=1
neural network. Local low rank matrix approximation
with s (1, ..., nb ) and nb being the number of basis by Lee et al. [17], introduces the idea of reconstructing
weight matrices. Here, ars are the learnable coefficients rating matrix entries using different (entry dependent)
that determine the linear combination for each decoder combinations of low rank approximations.
weight matrix Qr . Note that in order to avoid over-
fitting and to reduce the number of parameters, the Matrix completion with side information In
number of basis weight matrices nb should naturally matrix completion (MC) [3], the objective is to approx-
be lower than the number of rating levels. imate the rating matrix with a low-rank rating matrix.
Rank minimization, however, is an intractable problem,
3 Related work and Candes & Recht [3] replaced the rank minimization
with a minimization of the nuclear norm (the sum of
Auto-encoders User- or item-based auto-encoders the singular values of a matrix), turning the objective
[27, 32, 29] are a recent class of state-of-the-art collabo- function into a tractable convex one. Inductive matrix
completion (IMC) by Jain & Dhillon, 2013 and Xu et MovieLens 100K For this task, we compare against
al., 2013 incorporates content information of users and matrix completion baselines that make use of side
items in feature vectors and approximates the observed information in the form of user/item features. We
elements of the rating matrix as Mij = xTi U V T yj , with report performance on the canonical u1.base/u1.test
xi and yj representing the feature vector of user i and train/test split. Hyperparameters are optimized on a
item j respectively. 80/20 train/validation split of the original training set.
Side information is present both for users (e.g. age,
The geometric matrix completion (GMC) model pro-
gender, and occupation) and movies (genres). Follow-
posed by Kalofolias et al. in 2014 [12] introduces a
ing Rao et al. [25], we map the additional information
regularization of the MC model by adding side infor-
onto feature vectors for users and movies, and compare
mation in the form of user and item graphs. In [25],
the performance of our model with (GC-MC+Feat)
a more efficient alternating least squares optimization
and without the inclusion of these features (GC-MC) .
optimization method (GRALS) is introduced to the
Note that GMC [12], GRALS [25] and sRGCNN [22]
graph-regularized matrix completion problem. Most
represent user/item features via a k-nearest-neighbor
recently, Monti et al. [22] suggested to incorporate
graph. We use stacking as an accumulation function
graph-based side information in matrix completion via
in the graph convolution layer in Eq. (2), set dropout
the use of convolutional neural networks on graphs,
equal to 0.7, and use left normalization. GC-MC+Feat
combined with a recurrent neural network to model
uses 10 hidden units for the dense side information
the dynamic rating generation process. Their work is
layer (with ReLU activation) as described in Eq. 10.
different from ours, in that we model the rating graph
We train both models for 1,000 full-batch epochs. We
directly using a graph convolutional encoder/decoder
report RMSE scores averaged over 5 runs with random
approach that predicts unseen ratings in a single, non-
initializations4 . Results are summarized in Table 2.
iterative step.
Table 1: Number of users, items and ratings for each of the MovieLens datasets used in our experiments. We
further indicate rating density and rating levels.
0.94
In this work, we have introduced graph convolutional
matrix completion (GC-MC): a graph auto-encoder
0.92
framework for the matrix completion task in recom-
mender systems. The encoder contains a graph convo-
0.90
0 50 100 150 lution layer that constructs user and item embeddings
Number of cold-start users Nc through message passing on the bipartite user-item
interaction graph. Combined with a bilinear decoder,
Figure 3: Cold-start analysis for ML-100K. Test set new ratings are predicted in the form of labeled edges.
RMSE (average over 5 runs with random initialization)
for various settings, where only a small number of The graph auto-encoder framework naturally gener-
ratings Nr is kept for a certain number of cold-start alizes to include side information for both users and
users Nc during training. Standard error is below items. In this setting, our proposed model outper-
0.001 and therefore not shown. Dashed and solid lines forms recent related methods by a large margin, as
denote experiments without and with side information, demonstrated on a number of benchmark datasets with
respectively. feature- and graph-based side information. We further
show that our model can be trained on larger scale
datasets through stochastic mini-batching. In this set-
Nc {0, 50, 100, 150}, both with and without using ting, our model achieves results that are competitive
user/item features as side information (see Figure 3). with recent state-of-the-art collaborative filtering.
Hyperparameters and test set are chosen as before, In future work, we wish to extend this model to large-
i.e. we report RMSE on the complete canonical test scale multi-modal data (comprised of text, images, and
set split. The benefit of incorporating side information, other graph-based information), such as present in
such as user and item features, is especially pronounced many realisitic recommendation platforms. In such
in the presence of many users with only a single rating. a setting, the GC-MC model can be combined with
recurrent (for text) or convolutional neural networks
Discussion On the ML-100K task with side infor- (for images). To address scalability, it is necessary to
mation, our model outperforms related methods by a develop efficient approximate schemes, such as sub-
significant margin. Remarkably, this is even the case sampling local neighborhoods [10]. Finally, attention
without the use of side information. Most related to mechanisms [1] provide a promising future avenue for
our method is sRGCNN by Monti et al. [22] that uses extending this class of models.
graph convolutions on the nearest-neighbor graphs of
users and items, and learns representations in an it- Acknowledgments
erative manner using recurrent neural networks. Our
results demonstrate that a direct estimation of the We would like to thank Jakub Tomczak, Christos
rating matrix from learned user/item representations Louizos, Karen Ullrich and Peter Bloem for helpful
using a simple decoder model can be more effective, discussions and comments. This project is supported
while being computationally more efficient. by the SAP Innovation Center Network.
[7] Gintare Karolina Dziugaite and Daniel M Roy. Neu- [24] Michael J Pazzani and Daniel Billsus. Content-based
ral network matrix factorization. arXiv preprint recommendation systems. In The adaptive web, pages
arXiv:1511.06443, 2015. 325341. Springer, 2007.
[8] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, [25] Nikhil Rao, Hsiang-Fu Yu, Pradeep K. Ravikumar,
Oriol Vinyals, and George E. Dahl. Neural message and Inderjit S. Dhillon. Collaborative filtering with
passing for quantum chemistry. graph information: Consistency and scalable methods.
In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama,
[9] David Goldberg, David Nichols, Brian M. Oki, and and R. Garnett, editors, Advances in Neural Informa-
Douglas Terry. Using collaborative filtering to weave tion Processing Systems 28, pages 21072115. Curran
an information tapestry. Communications of the ACM, Associates, Inc., 2015.
35(12):6170, 1992.
[26] Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hin-
[10] William L. Hamilton, Rex Ying, and Jure Leskovec. In- ton. Restricted boltzmann machines for collaborative
ductive representation learning on large graphs. arXiv filtering. In Proceedings of the 24th international con-
preprint arXiv:1706.02216, 2017. ference on Machine learning, pages 791798. ACM,
2007.
[11] Prateek Jain and Inderjit S Dhillon. Provable inductive
matrix completion. arXiv preprint arXiv:1306.0626, [27] Suvash Sedhain, Aditya Krishna Menon, Scott Sanner,
2013. and Lexing Xie. Autorec: Autoencoders meet collabo-
rative filtering. In Proceedings of the 24th International
[12] Vassilis Kalofolias, Xavier Bresson, Michael Bronstein, Conference on World Wide Web, pages 111112. ACM,
and Pierre Vandergheynst. Matrix completion on 2015.
graphs. arXiv preprint arXiv:1408.1717, 2014.
[28] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky,
[13] Diederik Kingma and Jimmy Ba. Adam: A Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a
method for stochastic optimization. arXiv preprint simple way to prevent neural networks from overfitting.
arXiv:1412.6980, 2014. Journal of Machine Learning Research, 15(1):1929
1958, 2014.
[14] Thomas N. Kipf and Max Welling. Variational graph
auto-encoders. NIPS Bayesian Deep Learning Work- [29] Florian Strub, Romaric Gaudel, and Jrmie Mary.
shop, 2016. Hybrid recommender system based on autoencoders.
In Proceedings of the 1st Workshop on Deep Learning
[15] Thomas N. Kipf and Max Welling. Semi-supervised for Recommender Systems, DLRS 2016, pages 1116,
classification with graph convolutional networks. ICLR, New York, NY, USA, 2016. ACM.
2017. [30] Fei Tian, Bin Gao, Qing Cui, Enhong Chen, and Tie-
[16] Yehuda Koren, Robert Bell, and Chris Volinsky. Ma- Yan Liu. Learning deep representations for graph
trix factorization techniques for recommender systems. clustering. In AAAI, pages 12931299, 2014.
Computer, 42(8):3037, August 2009. [31] Miao Xu, Rong Jin, and Zhi-Hua Zhou. Speedup ma-
trix completion with side information: Application to
[17] Joonseok Lee, Seungyeon Kim, Guy Lebanon, and multi-label learning. In Advances in Neural Informa-
Yoram Singer. Local low-rank matrix approximation. tion Processing Systems, pages 23012309, 2013.
In Sanjoy Dasgupta and David McAllester, editors,
Proceedings of the 30th International Conference on [32] Yin Zheng, Bangsheng Tang, Wenkui Ding, and Han-
Machine Learning (ICML), volume 28 of Proceedings ning Zhou. A neural autoregressive approach to col-
of Machine Learning Research, pages 8290, Atlanta, laborative filtering. In Proceedings of the 33nd In-
Georgia, USA, 1719 Jun 2013. PMLR. ternational Conference on Machine Learning, pages
764773, 2016.
[18] Xin Li and Hsinchun Chen. Recommendation as link
prediction in bipartite graphs: A graph kernel-based
machine learning approach. Decision Support Systems,
54(2):880 890, 2013.
[19] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and
Richard Zemel. Gated graph sequence neural networks.
ICLR, 2016.