General Graph Random Features
General Graph Random Features
Abstract
arXiv:2310.04859v3 [stat.ML] 24 May 2024
1
Published as a conference paper at ICLR 2024
We assume that the sum above converges for all W under consideration, which can be
ensured with a regulariser W → σW, σ ∈ R+ . Without loss of generality, we also assume
2
Published as a conference paper at ICLR 2024
that α is normalised such that α0 = 1. The matrix Kα (W) can be associated with a graph
G
function Kα : N × N → R mapping from a pair of graph nodes to a real number.
Note that if G is an undirected graph then Kα (W) automatically inherits the symmetry
of W. In this case, it follows from Weyl’s perturbation inequality (Bai et al., 2000) that
Kα (W) is positive semidefinite for any given α provided the spectral radius ρ(W) :=
maxλ∈Λ(W) (|λ|) is sufficiently small (with Λ(W) the set of eigenvalues of W). This can
again be ensured by multiplying the weight matrix W by a regulariser σ ∈ R+ . It then
G
follows that Kα (W) can be considered the Gram matrix of a graph kernel function Kα .
With suitably chosen α = (αk )∞ k=0 , the class described by Eq. 2 includes many popular
examples of graph node kernels in the literature (Smola and Kondor, 2003; Chapelle et al.,
2002). They measure connectivity between nodes and are typically functions of the graph
f := [wij / di dj ]N . Here, di := P wij is
p
Laplacian matrix, defined by L := I − W f with W
i,j=1 j
the weighted degree of node i such that Wf is the normalised weighted adjacency matrix. For
reference, Table 1 gives the kernel definitions and normalised coefficients αk (corresponding
f to be considered later in the manuscript. In practice, factors in αk equal
to powers of W)
to a quantity raised to the power of k are absorbed into the normalisation of W. f
Name Form αk
−k
(IN + σ 2 L)−d 1 + σ −2
d+k−1
d-regularised Laplacian k
p −k
p-step random walk (αIN − L)p , α ≥ 2 k
(α − 1)
1 σ2 k
Diffusion exp(−σ 2 L/2) n ( )
k! 2 o
k−1
π k k
1
Inverse Cosine cos (Lπ/4) k! 4
· (−1) 2 if k even, (−1) 2 if k odd
G
Table 1: Different graph functions/kernels Kα : N ×N → R. The exp and cos mappings are
defined via Taylor series expansions rather than element-wise, e.g. exp(M) := limn→∞ (IN +
M/n)n and cos(M) := Re(exp(iM)). σ and α are regularisers. Note that the diffusion
kernel is sometimes instead defined by exp(σ 2 (IN − L)) but these forms are equivalent up to
normalisation. Also note that the p-step random walk kernel is closely related to the graph
Matérn kernel (Borovitskiy et al., 2021).
The chief goal of this work is to construct a random feature map ϕ(i) : N → Rl with l ∈ N
that provides unbiased approximation of Kα (W) as in Eq. 1. To do so, we consider the
following algorithm.
Algorithm 1 Constructing a random feature vector ϕf (i) ∈ RN to approximate Kα (W)
Input: weighted adjacency matrix W ∈ RN ×N , vector of unweighted node degrees (no.
neighbours) d ∈ RN , modulation function f : (N ∪ {0}) → R, termination probability
phalt ∈ (0, 1), node i ∈ N , number of random walks to sample m ∈ N.
Output: random feature vector ϕf (i) ∈ RN
1: initialise: ϕf (i) ← 0
2: for w = 1, ..., m do
3: initialise: load ← 1
4: initialise: current_node ← i
5: initialise: terminated ← False
6: initialise: walk_length ← 0
7: while terminated = False do
8: ϕf (i)[current_node] ← ϕf (i)[current_node]+load×f ( walk_length )
9: walk_length ← walk_length+1
10: new_node ← Unif [N ( current_node )] ▷ assign to one of neighbours
11: load ← load× d[current_node]
1−phalt × W [ current_node,new_node ] ▷ update load
12: current_node ← new_node
13: terminated ← (t ∼ Unif(0, 1) < phalt ) ▷ draw RV t to decide on termination
14: end while
15: end for
16: normalise: ϕf (i) ← ϕf (i)/m
3
Published as a conference paper at ICLR 2024
f (i)
Figure 1: Schematic for a random walk on a graph (solid red) and an accompanying modu-
lation function f (dashed blue) used to approximate an arbitrary graph node function K G .
This is identical to the algorithm presented by Choromanski (2023) for constructing features
to approximate the 2-regularised Laplacian kernel, apart from the presence of the extra
modulation function f : (N∪{0}) → R in line 8 that upweights or downweights contributions
from walks depending on their length (see Fig. 1). We refer to ϕf as general graph random
features (g-GRFs), where the subscript f identifies the modulation function. Crucially, the
time complexity of Alg. 1 is subquadratic in the number of nodes N , in contrast to exact
methods which are O(N 3 ).2
We now state the following central result, proved in App. A.2.
Theorem 2.1 (Unbiased approximation of Kα via convolutions). For two modulation func-
N
tions: f1 , f2 : (N ∪ {0}) → R, g-GRFs ϕf1 (i))Ni=1 , (ϕf2 (i) i=1 constructed according to Alg.
1 give unbiased approximation of Kα ,
[Kα ]ij = E ϕf1 (i)⊤ ϕf2 (j) ,
(3)
for kernels with an arbitrary Taylor expansion α = (αk )∞k=0 provided that α = f1 ∗ f2 . Here,
∗ is the discrete convolution of the modulation functions f1 , f2 ; that is, for all k ∈ (N ∪ {0}),
k
X
f1 (k − p)f2 (p) = αk . (4)
p=0
Clearly the class of pairs of modulation functions f1 , f2 that satisfy Eq. 4 is highly degen-
erate. Indeed, it is possible to solve for f1 given any f2 and α provided f2 (0) ̸= 0. For
instance, a trivial solution is given by: f1 (i) = αi , f2 (i) = I(i = 0) with I(·) the indicator
function. In this case, the walkers corresponding to f2 are ‘lazy’, depositing all their load
at the node at which they begin. Contributions to the estimator ϕf1 (i)⊤ ϕf2 (j) only come
from walkers initialised at i traversing all the way to j rather than two walkers both passing
through an intermediate node. Also of great interest is the case of symmetric modulation
functions f1 = f2 , where now intersections do contribute. In this case, the following is true
(proved in App. A.3).
Theorem 2.2 (Computing symmetric modulation functions). Supposing f1 = f2 = f , Eq.
4 is solved by a function f which is unique (up to a sign) and is given by
i 1
X X n
f (i) = ± 2 α1k1 α2k2 α3k3 ... . (5)
n=0
n k +2k +3k ...=i k1 k2 k3 ...
1 2 3
k1 +k2 +k3 +...=n
is some vector and the brackets give the order of computation. This is O(N 2 ).
4
Published as a conference paper at ICLR 2024
For symmetric modulation functions, the random features ϕf1 (i) and ϕf2 (i) are identical
apart from the particular sample of random walks used to construct them. They cannot
share the same sample or estimates of diagonal kernel elements [Kα ]ii will be biased.
Computational cost: Note that when running Alg. 1 one only needs to evaluate the
modulation functions f1,2 (i) up to the length of the longest walk one samples. A batch
of size b, (f1,2 (i))bi=1 , can be pre-computed in time O(b2 ) and reused for random features
corresponding to different nodes and even different graphs. Further values of f can be
computed at runtime if b is too small and also reused for later computations. Moreover, the
minimum length b required to ensure that all m walks are shorter than b with high probability
(Pr(∪mi=1 len(ωi ) ≤ b) > 1−δ, δ ≪ 1) scales only logarithmically with m (see App. A.1). This
means that, despite estimating a much more general family of graph functions, g-GRFs are
essentially no more expensive than the original GRF algorithm. Moreover, any techniques
used for dimensionality reduction of regular GRFs (e.g. applying the Johnson-Lindenstrauss
transform (Dasgupta et al., 2010) or using ‘anchor points’ (Choromanski, 2023)) can also
be used with g-GRFs, providing further efficiency gains.
Generating functions: Inserting the constraint for unbiasedness in Eq. 4 back into the
definition of Kα (W), we immediately have that
Kα (W) = Kf1 (W)Kf2 (W) (7)
P∞ i
where Kf1 (W) := i=0 f1 (i)W is the generating function corresponding to the sequence
(f1 (i))∞
i=0 . This is natural because the (discrete) Fourier transform of a (discrete) convolu-
tion returns the product of the (discrete) Fourier transforms of the respective functions. In
the symmetric case f1 = f2 , it follows that
1
Kf (W) = ± (Kα (W)) 2 . (8)
If the RHS has a simple Taylor expansion (e.g. Kα (W) = exp(W) so Kf (W) = exp( W 2 )),
this enables us obtain f without recourse to the conditional sum in Eq. 5 or the iterative
expression in Eq. 6. This is the case for many
popular graph kernels; we provide some promi-
Name f (i)
nent examples in the table left. A notable ex-
(d−2+2i)!!
d-regularised Laplacian (2i)!!(d−2)!!
ception is the inverse cosine kernel.
p
p-step random walk 2
i
As an interesting corollary, by considering
Diffusion 1 the diffusion kernel we have also proved that
2i i! Pk 1 1 1
p=0 2p p! 2k−p (k−p)! = k! .
5
Published as a conference paper at ICLR 2024
Theorem 2.3 (Empirical Rademacher complexity bound). For a fixed sample S = (xi )m
i=1 ,
the empirical Rademacher complexity R(H) is bounded by
b
v
u ∞
u1 X (M )
R(H) ≤
b t α ρ(W)i , (10)
m i=0 i
where ρ(W) is the spectral radius of the weighted adjacency matrix W.
3 Experiments
Here we test the empirical performance of g-GRFs, both for approximating fixed kernels
(Secs 3.1-3.3) and with learnable neural modulation functions (Secs 3.4-3.5).
We begin by confirming that g-GRFs do indeed give unbiased estimates of the graph kernels
listed in Table 1, taking regularisers σ = 0.25 and α = 20 with phalt = 0.1. We use symmetric
modulation functions f , computed with the closed forms where available and using the
iterative scheme in Eq. 6 if not. Fig. 2 plots the relative Frobenius norm error between
the true kernels K and their approximations with g-GRFs K b (that is, ∥K − K∥b F /∥K∥F )
against the number of random walkers m. We consider 8 different graphs: a small random
Erdős-Rényi graph, a larger Erdős-Rényi graph, a binary tree, a d-regular graph and 4 real
world examples (karate, dolphins, football and eurosis) (Ivashkin, 2023). They vary
substantially in size. For every graph and for all kernels, the quality of the estimate improves
as m grows and becomes very small with even a modest number of walkers.
ER (N = 20, p = 0.2) ER (N = 100, p = 0.04) Binary tree (N = 127) d-regular (N = 100, d = 10)
Frob. norm error
10 3
10 3 10 3 10 3
1-reg Lap
2-reg Lap
10 4 2-step RW 10 4 10 4 10 4
Diffusion
Cosine
5 10 15 5 10 15 5 10 15 5 10 15
No. random walks No. random walks No. random walks No. random walks
Karate (N = 34) Dolphins (N = 62) Football (N = 115) Eurosis (N = 1272)
Frob. norm error
10 3 10 3 10 3 10 3
10 4
10 4 10 4 10 4
5 10 15 5 10 15 5 10 15 5 10 15
No. random walks No. random walks No. random walks No. random walks
Figure 2: Unbiased estimation of popular kernels on different graphs using g-GRFs. The
approximation error (y-axis) improves with the number of walkers (x-axis). We repeat 10
times; one standard deviation of the mean error is shaded.
6
Published as a conference paper at ICLR 2024
0.06
der the ODE for t = 1 with n = 10 discreti-
sation timesteps and P uniform, approximating
0.04
exp(W(t − τj )) with different numbers of walk-
ers m. As m grows, the quality of approxima- 0.02
tion improves and the (normalised) error on the
final state ∥b
x(1) − x(1)∥2 /∥x(1)∥2 drops for ev- 2 4 6 8 10
No. walkers
12 14 16
ery graph. We take 100 repeats for statistics and
plot the results in Fig. 3. One standard devia- Figure 3: ODE simulation error decreases
tion of the mean error is shaded. phalt = 0.1 and as the number of walkers grows.
the regulariser is given by σ = 1.
7
Published as a conference paper at ICLR 2024
Following the discussion in Sec. 2.1 we now replaced fixed f with a neural modulation
function f (N ) parameterised by a simple neural network with 1 hidden layer of size 1:
f (N ) (x) = σsoftplus (w2 σReLU (w1 x + b1 ) + b2 ) , (16)
where w1 , b1 , w2 , b2 ∈ R and σReLU and σsoftplus are the ReLU and softplus activation func-
tions, respectively. Bigger, more expressive architectures (including allowing f (N ) to take
negative values) can be used but this is found to be sufficient for our purposes.
We define our loss function to be the Frobenius norm error between a target Gram matrix
and our g-GRF-approximated Gram matrix on the small Erdős-Rényi graph (N = 20) with
m = 16 walks. For the target, we choose the 2-regularised Laplacian kernel. We train
(N ) (N )
symmetric f1 = f2 but provide a brief discussion of the asymmetric case (including its
respective strengths and weaknesses) in App. A.5. On this graph, we minimise the loss with
the Adam optimiser and a decaying learning rate (LR = 0.01, γ = 0.975, 1000 epochs). We
make the following striking observation: f (N ) does not generically converge to the unique
unbiased (symmetric) modulation function implied by α, but instead to some different f
that though biased gives a smaller mean squared error (MSE). This is possible because by
downweighting long walks the learned f (N ) gives estimators with a smaller variance, which
is sufficient to suppress the MSE on the kernel matrix elements even though it no longer
gives the target value in expectation. We then fix f (N ) and use it for kernel approximation
on the remaining graphs. The learned, biased f (N ) still provides better kernel estimates,
including for graphs with very different topologies and a much greater number of nodes:
eurosis is bigger by a factor of over 60. See Table 3 for the results. phalt = 0.5 and σ = 0.8.
Naturally, the learned f (N ) is dependent on the the number of random walks m; as m grows,
the variance on the kernel approximation drops so it is intuitive that the learned f (N ) will
approach the unbiased f . Fig. 4 empirically confirms this is the case, showing the learned
f (N ) for different numbers of walkers. The line labelled ∞ is the unbiased modulation
function, which for the 2-regularised Laplacian kernel is constant.
64
d-regular 100 0.0490(2) 0.0434(2) 0.4 128
karate 34 0.0492(6) 0.0439(6) 256
512
dolphins 62 0.0505(5) 0.0449(5) 0.2
football 115 0.0520(2) 0.0459(2)
eurosis 1272 0.0551(2) 0.0484(2) 0.0
0 2 4 6 8
i
These learned modulation functions might guide the more principled construction of biased,
low-MSE GRFs in the future. An analogue in Euclidean space is provided by structured
orthogonal random features (SORFs), which replace a random Gaussian matrix with a HD-
product to estimate the Gaussian kernel (Choromanski et al., 2017; Yu et al., 2016). This
likewise improves kernel approximation quality despite breaking estimator unbiasedness.
As suggested in Sec. 2.1, it is also possible to train the neural modulation function f (N )
directly using performance on a downstream task, performing implicit kernel learning. We
have argued that this is much more scalable than optimising Kα directly.
8
Published as a conference paper at ICLR 2024
In this spirit, we now address the problem of kernel regression on triangular mesh graphs,
as previously considered by Reid et al. (2023). For graphs in this dataset (Dawson-
Haggerty, 2023), every node is associated with a normal vector v (i) ∈ R3 equal to the
mean of the normal vectors of its 3 surrounding faces. The task is to predict the direc-
tions of missing vectors (a random P b5% split) from the remainder. Our (unnormalised)
predictions are given by v b(i) := j K (N )
(i, j)v (j) , where K b (N ) is a kernel estimate con-
structed using g-GRFs with a neural modulation function f (N ) (see Eq. 3). The an-
gular prediction error is 1 − cos θi with θi the angle between the true v (i) and and
approximate v b(i) normals, averaged over the
missing vectors. We train a symmetric pair f (N ) Modulation functions for regression task
1.0
using this angular prediction error on the small
cylinder graph (N = 210) as the loss function. 0.8
Then we freeze f (N ) and compute the angular 0.6 learned
prediction error for other larger graphs. Fig. 5 1-reg Lap
f(i)
2-reg Lap
shows the learned f (N ) as well as some other 0.4 diffusion
modulation functions corresponding to popular
0.2
fixed kernels. Note also that learning f (N ) al-
ready includes (but is not limited to) optimis- 0.0
ing the lengthscale of a given kernel: taking 0 2 4 6 8
f → βW f is identical to f (i) → f (i)β i ∀ i. i
W
Figure 5: Fixed and learned modulation
The prediction errors are highly correlated be- functions for kernel regression
tween the different modulation functions for a
given random draw of walks; ensembles which ‘explore’ the graph poorly, terminating quickly
or repeating edges, will give worse g-GRF estimators for every f . For this reason, we com-
pute the prediction errors as the average normalised difference compared to the learned
kernel result. Table 4 reports the results. Crucially, this difference is found to be positive
for every graph and every fixed kernel, meaning the learned kernel always performs best.
Table 4: Normalised difference in angular prediction error compared to the learned
kernel defined by f (N ) (trained on the cylinder graph). All entries are positive since
the learned kernel always performs best. We take 100 repeats (but only 10 for the very
large cycloidal graph). The brackets give one standard deviation on the final digit.
It is remarkable that the learned f (N ) gives the smallest error for all the graphs even
though it was just trained on cylinder, the smallest one. We have implicitly learned a
good kernel for this task which generalises well across topologies. It is also intriguing that
the diffusion kernel performs only slightly worse. This is to be expected because their
modulation functions are similar (see Fig. 5) so they encode very similar kernels, but this
will not always be the case depending on the task at hand.
4 Conclusion
We have introduced ‘general graph random features’ (g-GRFs), a novel random walk-based
algorithm for time-efficient estimation of arbitrary functions of a weighted adjacency matrix.
The mechanism is conceptually simple and trivially distributed across machines, unlocking
kernel-based machine learning on very large graphs. By parameterising one component of
the random features with a simple neural network, we can futher suppress the mean square
error of estimators and perform scalable implicit kernel learning.
9
Published as a conference paper at ICLR 2024
Ethics: Our work is foundational. There are no direct ethical concerns that we can see,
though of course increases in scalability afforded by graph random features might amplify
risks of graph-based machine learning, from bad actors or as unintended consequences.
Reproducibility: To foster reproducibility, we clearly state the central algorithm in Alg. 1.
Source code is available at https://github.com/isaac-reid/general_graph_random_features.
All theoretical results are accompanied by proofs in Appendices A.1-A.4, where any assump-
tions are made clear. The datasets we use correspond to standard graphs and are freely
available online. We link suitable repositories in every instance. Except where prohibitively
computationally expensive, results are reported with uncertainties to help comparison.
EB initially proposed using a modulation function to generalise GRFs to estimate the dif-
fusion kernel and derived its mathematical expression. IR and KC then developed the full
g-GRFs algorithm for general functions of a weighted adjacency matrix, proving all theoret-
ical results, running the experiments and preparing the manuscript. AW provided helpful
discussions and advice.
IR acknowledges support from a Trinity College External Studentship. AW acknowledges
support from a Turing AI fellowship under grant EP/V025279/1 and the Leverhulme Trust
via CFI.
We thank Kenza Tazi and Austin Tripp for their careful readings of the manuscript. Richard
Turner provided valuable suggestions and support throughout the project.
References
Zhaojun Bai, James Demmel, Jack Dongarra, Axel Ruhe, and Henk van der Vorst. Templates
for the solution of algebraic eigenvalue problems: a practical guide. SIAM, 2000. URL
https://www.cs.ucdavis.edu/~bai/ET/contents.html.
Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk
bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482,
2002. URL https://dl.acm.org/doi/pdf/10.5555/944919.944944.
Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J
Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels. Bioinfor-
matics, 21(suppl_1):i47–i56, 2005. URL https://doi.org/10.1093/bioinformatics/
bti1007.
Stéphane Canu and Alexander J. Smola. Kernel methods and the exponential fam-
ily. Neurocomputing, 69(7-9):714–720, 2006. doi: 10.1016/j.neucom.2005.12.009. URL
https://doi.org/10.1016/j.neucom.2005.12.009.
Olivier Chapelle, Jason Weston, and Bernhard Schölkopf. Cluster kernels for semi-supervised
learning. Advances in neural information processing systems, 15, 2002. URL https:
//dl.acm.org/doi/10.5555/2968618.2968693.
10
Published as a conference paper at ICLR 2024
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane,
Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al.
Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020. URL
https://doi.org/10.48550/arXiv.2009.14794.
Krzysztof M Choromanski, Mark Rowland, and Adrian Weller. The unreasonable effec-
tiveness of structured random orthogonal embeddings. Advances in neural information
processing systems, 30, 2017. URL https://doi.org/10.48550/arXiv.1703.00864.
Krzysztof Marcin Choromanski. Taming graph kernels with random features. In In-
ternational Conference on Machine Learning, pages 5964–5977. PMLR, 2023. URL
https://doi.org/10.48550/arXiv.2305.00156.
Fan R. K. Chung and Shing-Tung Yau. Coverings, heat kernels and spanning trees. Electron.
J. Comb., 6, 1999. doi: 10.37236/1444. URL https://doi.org/10.37236/1444.
Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Generalization bounds for
learning kernels. In Proceedings of the 27th Annual International Conference on Machine
Learning (ICML 2010), 2010. URL http://www.cs.nyu.edu/~mohri/pub/lk.pdf.
Keenan Crane, Clarisse Weischedel, and Max Wardetzky. The heat method for distance
computation. Communications of the ACM, 60(11):90–99, 2017. URL https://dl.acm.
org/doi/10.1145/3131280.
Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse johnson: Lindenstrauss trans-
form. In Leonard J. Schulman, editor, Proceedings of the 42nd ACM Symposium on The-
ory of Computing, STOC 2010, Cambridge, Massachusetts, USA, 5-8 June 2010, pages
341–350. ACM, 2010. doi: 10.1145/1806689.1806737. URL https://doi.org/10.1145/
1806689.1806737.
Michael Dawson-Haggerty. Trimesh repository, 2023. URL https://github.com/mikedh/
trimesh.
Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering
and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 551–556, 2004. URL https://dl.acm.
org/doi/10.1145/1014052.1014118.
Michel X. Goemans and David P. Williamson. Approximation algorithms for max -3-cut
and other problems via complex semidefinite programming. J. Comput. Syst. Sci., 68
(2):442–470, 2004. doi: 10.1016/j.jcss.2003.07.012. URL https://doi.org/10.1016/j.
jcss.2003.07.012.
Vladimir Ivashkin. Community graphs repository, 2023. URL https://github.com/
vlivashkin/community-graphs.
William B Johnson. Extensions of lipschitz mappings into a hilbert space. Contemp. Math.,
26:189–206, 1984. URL http://stanford.edu/class/cs114/readings/JL-Johnson.
pdf.
Kyle Kloster and David F Gleich. Heat kernel based community detection. In Proceedings
of the 20th ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 1386–1395, 2014. URL https://doi.org/10.48550/arXiv.1403.3148.
Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding
the generalization error of combined classifiers. The Annals of Statistics, 30(1):1–50, 2002.
URL https://doi.org/10.48550/arXiv.math/0405343.
Risi Kondor and John D. Lafferty. Diffusion kernels on graphs and other discrete input
spaces. In Claude Sammut and Achim G. Hoffmann, editors, Machine Learning, Proceed-
ings of the Nineteenth International Conference (ICML 2002), University of New South
Wales, Sydney, Australia, July 8-12, 2002, pages 315–322. Morgan Kaufmann, 2002. URL
https://www.ml.cmu.edu/research/dap-papers/kondor-diffusion-kernels.pdf.
11
Published as a conference paper at ICLR 2024
Leonid Kontorovich, Corinna Cortes, and Mehryar Mohri. Kernel methods for learning
languages. Theor. Comput. Sci., 405(3):223–236, 2008. doi: 10.1016/j.tcs.2008.06.037.
URL https://doi.org/10.1016/j.tcs.2008.06.037.
Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances
in neural information processing systems, 20, 2007. URL https://dl.acm.org/doi/10.
5555/2981562.2981710.
Isaac Reid, Krzysztof Choromanski, and Adrian Weller. Quasi-monte carlo graph random
features. arXiv preprint arXiv:2305.12470, 2023. URL https://doi.org/10.48550/
arXiv.2305.12470.
Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In Learning The-
ory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel
Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceed-
ings, pages 144–158. Springer, 2003. URL https://people.cs.uchicago.edu/~risi/
papers/SmolaKondor.pdf.
Alexander J. Smola and Bernhard Schölkopf. Bayesian kernel methods. In Shahar Mendel-
son and Alexander J. Smola, editors, Advanced Lectures on Machine Learning, Machine
Learning Summer School 2002, Canberra, Australia, February 11-22, 2002, Revised Lec-
tures, volume 2600 of Lecture Notes in Computer Science, pages 65–117. Springer, 2002.
URL https://doi.org/10.1007/3-540-36434-X_3.
Yasutoshi Yajima. One-class support vector machines for recommendation tasks. In Pacific-
Asia Conference on Knowledge Discovery and Data Mining, pages 230–239. Springer,
2006. URL https://dl.acm.org/doi/10.1007/11731139_28.
Felix Xinnan X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N
Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. Advances in neural in-
formation processing systems, 29, 2016. URL https://doi.org/10.48550/arXiv.1610.
09072.
Yufan Zhou, Changyou Chen, and Jinhui Xu. Learning manifold implicitly via explicit heat-
kernel learning. Advances in Neural Information Processing Systems, 33:477–487, 2020.
URL https://doi.org/10.48550/arXiv.2010.01761.
12
Published as a conference paper at ICLR 2024
A Appendices
A.1 Minimum batch size scales logarithmically with number of walks
m
Consider an ensemble of m random walks S := {ωi }i=1 whose lengths are sampled from a
geometric distribution with termination probability p. Trivially, Pr(len(ωi ) < b) = 1 − (1 −
p)b . Given m independent walkers,
Pr (maxωi ∈S (len(ωi ) < b) = Pr (∪m b m
i=1 len(ωi ) < b) = (1 − (1 − p) ) . (17)
Take ‘with high probability’ to mean with probability at least 1 − δ, with δ ≪ 1 fixed. Then
1
log(1 − (1 − δ) m ) log(δ) − log(m)
b= ≃ . (18)
log(1 − p) log(1 − p)
As the number of walkers m grows, the minimum value of b to ensure that all walkers are
shorter than b with high probability scales logarithmically with m.
We begin by proving that g-GRFs constructed according to Alg. 1 give unbiased estimation
of graph functions with Taylor coefficients (αi )∞
i=0 provided the discrete convolution relation
in Eq. 4 is fulfilled.
(i)
Denote the set of m walks sampled out of node i by {Ω̄k }m k=1 . Carefully considering Alg.
1, it is straightforward to convince oneself that the g-GRF estimator takes the form
m
1 X X ω
e (ωiv )f (len(ωiv )) (i)
ϕ(i)v := I(ωiv ∈ Ω̄k ). (19)
m p(ωiv )
k=1 ωiv ∈Ωiv
Here, Ωiv denotes the set of all graph walks between nodes i and v, of which each ωiv is
a member. len(ωiv ) is a function that gives the length of walk ωiv and ω e (ω ) evaluates
len(ωiv )
Qlen(ωiv ) 1 iv
the products of edge weights it traverses. p(ωiv ) = (1 − p) i=1 di denotes the
walk’s marginal probability. I is the indicator function which evaluates to 1 if its argument
(i)
is true (namely, walk ωiv is a prefix subwalk
h of Ω̄ki , the kth walk sampled out of i) and 0
(i)
otherwise. Trivially, we have that E I(ωiv ∈ Ω̄k ) = p(ωiv ) (by construction to make the
estimator unbiased).3
Now note that:
" #
X
⊤
E ϕ(i) ϕ(j) = E ϕ(i)v ϕ(j)v
v∈N
X X X
= ω ω (ωjv )f (len(ωjv ))
e (ωiv )f (len(ωiv ))e
v ωiv ∈Ωiv ωjv ∈Ωjv
∞ X
XX ∞
l1 l2
= Wiv Wjv f (l1 )f (l2 )
(20)
v l1 =0 l2 =0
∞ X
XX l1
l1 −l3 l3
= Wiv Wjv f (l1 − l3 )f (l3 )
v l1 =0 l3 =0
∞
X l1
X
l1
= Wij f (l1 − l3 )f (l3 ).
l1 =0 l3 =0
3
To flag a subtle point: in the sum we take ωiv ∈ Ωiv to mean that ωiv is a member of the set of
all possible walks of any length between nodes i and v, Ωiv . On the other hand, inside the indicator
(i) (i)
function by ωiv ∈ Ω̄k we mean that ωiv is a prefix subwalk of one particular walk Ω̄k sampled
out of node i. In these two cases the interpretation of the symbol ∈ should be slightly different.
13
Published as a conference paper at ICLR 2024
From the first to the second line we used the definition of GRFs in Eq. 19. We then rewrote
the sum of the product of edge weights over all possible paths as powers of the weighted
adjacency matrix W, with l1,2 corresponding to walk lengths. To get to the fourth line, we
changed indices in the infinite sums, then the final line followed simply.
P∞
The final expression in Eq. 20 is exactly equal to Kα (W) := l1 =0 αl1 Wl1 provided we
have that
Xl1
f (l1 − l3 )f (l3 ) := αl1 , (21)
l3 =0
as stated in Eq. 4 of the main text (with variables renamed to k and p).
Here, we show how to compute f under the constraint that the modulation functions are
identical, f1 = f2 = f . We will use the relationship in Eq. 4, reproduced below for the
reader’s convenience:
Xi
f (i − p)f (p) = αi . (22)
p=0
2
√ form in Eq. 6 is easy to show. For i = 0 we have that f (0) = α0 , so
The iterative
f (0) = ± α0 = ±1 (where we used the normalisation condition α0 = 1). Now note that
i+1
X i
X
f (i + 1 − p)f (p) = 2f (0)f (i + 1) + f (i + 1 − p)f (p) = αi+1 (23)
p=0 p=1
so clearly
Pi−1
αi+1 − p=0 f (i − p)f (p + 1)
f (i + 1) = . (24)
2f (0)
This enables us to compute f (i + 1) given αi+1 and (f (p))ip=0 .
The analytic form in Eq. 5 is only a little harder. Inserting the discrete convolution relation
in Eq. 4 back into Eq. 2, we have that
Kα (W) = Kf1 (W)Kf2 (W) (25)
P∞
where Kf1 (W) := i=0 f1 (i)Wi is the generating function corresponding to the sequence
(f1 (i))∞
i=0 . Constraining f1 = f2 ,
1
Kf (W) = ± (Kα (W)) 2 . (26)
We also discuss this in the ‘generating functions’ section of Sec. 2 where we use it to derive
simple closed forms for f for some special kernels. Now we have that
∞ ∞
! 21
X X
f (i)Wi = ± αn W n
i=0 n=0
1 (27)
= ± 1 + α1 W + α2 W2 + ... 2
∞ 1
X n
=± 2 α1 W + α2 W2 + ... .
n=0
n
We need to equate powers of W between the generating functions. Consider the terms
proportional to Wi . Clearly no such terms will feature when n > i, so we can n
restrict the
sum on the RHS to 0 ≤ n ≤ i. Meanwhile, the term in α1 W + α2 W2 + ... proportional
to Wi is nothing other than
X n
α1k1 α2k2 α3k3 ... (28)
k +2k +3k ...=i
k1 k2 k3 ...
1 2 3
k1 +k2 +k3 +...=n
14
Published as a conference paper at ICLR 2024
n
where k1 k2 k3 ... is the multinomial coefficient. Combining,
i 1
X X n
f (i) = ± 2 α1k1 α2k2 α3k3 ... (29)
n=0
n k1 +2k2 +3k3 ...=i
k1 k2 k3 ...
k1 +k2 +k3 +...=n
as shown in Eq. 5. Though this expression gives f purely in terms of α, the presence of the
conditional sum limits its practical utility compared to the iterative form.
In this appendix we derive the bound on the empirical Rademacher complexity stated in
Theorem 2.3 and show the consequent generalisation bounds. The early stages closely follow
arguments made by Cortes et al. (2010). Recall that we have defined the hypothesis set
(M )
H = {x → w⊤ ψKα (x) : |αi | ≤ αi , ∥w∥2 ≤ 1}, (30)
with ψKα : x → HKα the feature mapping from the input space to the reproducing kernel
G
Hilbert space HKα induced by the kernel Kα .
The empirical Rademacher complexity R(H)
b for an arbitrary fixed sample S = (xi )m
i=1 is
defined by " #
m
1 X
R(H)
b := Eσ sup σi h(xi ) (31)
m h∈H i=1
the expectation is taken over σ = (σ1 , ..., σm ) with σi ∼ Unif(±1) i.i.d. Rademacher random
variables.
Begin by noting that
m
X
h(x) := w⊤ ψKα (x) = βi Kα (xi , x) (32)
i=1
where βi are the coordinates of the orthogonal projections of w on HS =
span(ψKα (x1 ), ..., ψKα (xm )), where β ⊤ Kα β ≤ 1. Then we have that
" #
1
R(H)
b = Eσ sup σ ⊤ Kα β . (33)
m α,β
1/2 1/2
The supremum supβ σ ⊤ Kα β is reached when Kα β is collinear with Kα σ, and making
∥β∥2 as large as possible gives
1 p
⊤
R(H) = Eσ sup σ Kα σ .
b (34)
m α
15
Published as a conference paper at ICLR 2024
as stated in Thm 2.3. This bound is not tight for general graphs, but will be for specific ex-
amples: for example, when W is proportional to the identity so the only edges are self-loops.
Nonetheless, it provides some intuition for how the learned kernel’s complexity depends on
W.
As stated in the main text, Eq. 37 immediately yields generalisation bounds for learning
kernels. Again closely following Cortes et al. (2010), consider the application of binary clas-
sification where nodes are assigned a label yi = ±1. Denote by R(h) the true generalisation
error of h ∈ H,
R(h) = Pr(yh(x) < 0). (38)
m
Consider a training sample S = ((xi , yi ))i=1 , and define the ρ-empirical margin loss by
m
bρ (h) := 1
X
R min(1, [1 − yi h(xi )/ρ]+ ) (39)
m i=1
where ρ > 0. For any δ > 0, with probability at least 1 − δ, the following bound holds for
any h ∈ H (Bartlett and Mendelson, 2002; Koltchinskii and Panchenko, 2002):
s
bρ (h) + 2 R(H) log 2δ
R(h) ≤ R b +3 . (40)
ρ 2m
Inserting the bound on the empirical Rademacher complexity in Eq. 37, we immediately
have that v s
∞
log 2δ
u
bρ (h) + 2 t 1 (M )
u X
R(h) ≤ R αi ρ(W)i + 3 (41)
ρ m i=0 2m
which shows how the generalisation error can be controlled via α(M ) or ρ(W).
Recalling from Eq. 1 that the modulation func- Learned asymmetric modulation functions
tions (f1 , f2 ) do not necessarily need to be equal 1.0
for unbiased estimation of some kernel Kα , a
natural extension of Sec. 3.4 is to introduce 0.8
(N ) initial f1
two separate neural modulation functions f1,2
0.6 initial f2
and train them both following the scheme in
f(i)
trained f1
Sec. 3.4. Intriguingly, even with an initialisa- trained f2
(N ) 0.4
tion where f2 encodes ‘lazy’ behaviour (de-
posits almost all its load at the starting node 0.2
(N )
– see Sec. 2) and f1 is flat, upon training
the neural modulation functions quickly become 0 2 4 6 8
i
very similar (though not identical). See Fig. 6.
We use the same optimisation hyperparameters Figure 6: Modulation functions (f , f )
1 2
and network architectures as in Sec. 3.4. Rigor- parameterised by separate neural net-
ously proving the best possible choice of (f1 , f2 ), works before and after training to target
including whether e.g. a symmetric pair is opti- the 2-regularised Laplacian kernel.
mal, is left as an exciting open theoretical prob-
lem. We note that, whilst parameterising two separate neural modulation functions gives
a more general mechanism (with the symmetric pair as a special case), it also doubles the
number of parameters required so slows training and evaluation.
Here we further investigate the behaviour of the g-GRFs kernel approximation as the prop-
erties of the graph change. To do this, we generate random graphs using the Erdős-Rényi
model, where each of the N2 possible edges are present with probability pedge or absent
with probability 1 − pedge . Edges are independent.
16
Published as a conference paper at ICLR 2024
Firstly, we investigate how the approximation error varies with graph sparsity. We take
a fixed number of nodes (N = 100) and control the sparsity by varying pedge between
0.1 and 0.9. We then approximate the diffusion kernel using g-GRFs with a termination
probability phalt = 0.1 and 8 walkers. For each graph we compute the relative Frobenius
norm error between the true (K) and approximated (K) b kernels: namely, ∥K − K∥ b F /∥K∥F .
We repeat 100 times to obtain the standard deviation of the mean error estimate. Fig. 7
shows the results; approximation quality degrades slightly as pedge grows, but remains very
good throughout (error < 0.00275).
Next, we investigate the scalability of the method by comparing the time for exact and
approximate evaluation of the diffusion kernel as the size of the graph grows. We fix pedge =
0.5 and vary the number of nodes N between 100 and 12800. For every graph, we measure
the wall-clock time for i) exact computation of K by computing the matrix exponential and
ii) approximate computation of K b using g-GRFs (8 walkers, phalt = 0.5, regulariser σ = 0.5).
Fig. 8 shows the results. Naturally, the exact method scales worse, becoming slower for
graphs bigger than a few thousand nodes, and by the largest graph g-GRFs are already
faster by a factor of 7.8. We also measure the relative Frobenius norm between the true and
approximated Gram matrices to check that the quality of estimation remains good. This is
shown by the green line, which indeed remains almost constant and takes very small values
for the error (≃ 0.005).
Figure 7: Approximation error vs. edge- Figure 8: Wall clock time for exact and ap-
generation probability for the diffusion ker- proximate kernel evaluation as the number
nel on a graph of size N = 100. The quality of nodes varies. g-GRFs scale better and are
remains good as sparsity changes, varying faster with a few thousand nodes. The ap-
within a narrow range. proximation error remains low as N grows.
Approximation error vs. edge probability Wall-clock time vs graph size 0.00600
0.00275 102 exact ev. time approximation error
approx. ev. time 0.00575
0.00270 101 0.00550
wall-clock time (s)
Frob. norm error
In this short section, we provide further experimental details and discussion to supplement
the main text.
1. Choice of phalt : The termination probability encodes a simple trade-off between
approximation quality and speed; if phalt is lower, walks tend to last longer and
sample more subwalks so give a better approximation of graph kernels. In prac-
tice, any reasonably small value works well, as also reported for the original GRFs
mechanism (Choromanski, 2023). In experiments we typically choose phalt ∼ 0.1.
2. Choice of σ and kernel convergence: P After Eq. 2 we noted that, when ap-
∞
proximating some fixed kernel Kα (W) = k=0 αk Wk , we assume that the sum
does not diverge. It would not be possible to constructP∞ a finite random feature
estimator if this were not the case. We require that k=0 αk ρ(W)k is finite, with
ρ(W) := maxλ∈Λ(W) (|λ|) the spectral radius of the weighted adjacency matrix
(Λ(W) is the set of eigenvalues of W). When considering e.g. the diffusion kernel
in the literature, this is ensured by multiplying W by some regulariser σ ∈ R+ , tak-
17
Published as a conference paper at ICLR 2024
ing e.g. W → σW whereupon λi → σλi ∀ i = 1, ..., N . This is the reason for extra
parameters {σ, α} in the kernel expressions in Table 1. As we noted in Sec. 3.5,
this is exactly equivalent to transforming the modulation function f (i) → f (i)σ i ∀ i.
Where space allows, we report the exact choice of σ with each experiment (though
empirically it does not modify p our conclusions). In Sec. 3.5 we take the weighted
:= N
adjacency matrix W [aij / di dj ]i,j=1 with aij = 1 if node i is connected to j and
0 otherwise, and di the degree of node i. We then regularise by taking W → σW
with σ = 0.025, which is small enough for convergence even with the largest graph
considered. We use m = 16 walkers and phalt = 0.5.
18