0% found this document useful (0 votes)

23 views18 pages

General Graph Random Features

The paper introduces a novel algorithm called general graph random features (g-GRFs) for unbiased estimation of functions of a weighted adjacency matrix, achieving subquadratic time complexity. This method allows for efficient and scalable kernel learning on large networks by utilizing a modulation function that adjusts contributions from random walks based on their lengths. The authors provide theoretical analysis and experimental results demonstrating the effectiveness of g-GRFs in various applications, including kernel estimation and node clustering.

Uploaded by

penber0427

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views18 pages

General Graph Random Features

Uploaded by

penber0427

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Published as a conference paper at ICLR 2024

General Graph Random Features

Isaac Reid1∗, Krzysztof Choromanski2,3∗, Eli Berger4∗, Adrian Weller1,5
1
University of Cambridge, 2 Google DeepMind, 3 Columbia University,
4
University of Haifa, 5 Alan Turing Institute
[email protected], [email protected]

Abstract
arXiv:2310.04859v3 [stat.ML] 24 May 2024

We propose a novel random walk-based algorithm for unbiased estimation

of arbitrary functions of a weighted adjacency matrix, coined general graph
random features (g-GRFs). This includes many of the most popular ex-
amples of kernels defined on the nodes of a graph. Our algorithm en-
joys subquadratic time complexity with respect to the number of nodes,
overcoming the notoriously prohibitive cubic scaling of exact graph kernel
evaluation. It can also be trivially distributed across machines, permit-
ting learning on much larger networks. At the heart of the algorithm is
a modulation function which upweights or downweights the contribution
from different random walks depending on their lengths. We show that by
parameterising it with a neural network we can obtain g-GRFs that give
higher-quality kernel estimates or perform efficient, scalable kernel learn-
ing. We provide robust theoretical analysis and support our findings with
experiments including pointwise estimation of fixed graph kernels, solving
non-homogeneous graph ordinary differential equations, node clustering and
kernel regression on triangular meshes.1

1 Introduction and related work

The kernel trick is a powerful technique to perform nonlinear inference using linear learn-
ing algorithms (Campbell, 2002; Kontorovich et al., 2008; Canu and Smola, 2006; Smola
and Schölkopf, 2002). Supposing we have a set of N datapoints X = {xi }N i=1 , it replaces
Euclidean dot products x⊤ i x j with evaluations of a kernel function K : X × X → R, cap-
turing the ‘similarity’ of the datapoints by instead taking an inner product between implicit
(possibly infinite-dimensional) feature vectors in some Hilbert space HK .
An object of key importance is the Gram matrix K ∈ RN ×N whose entries enumerate
the pairwise kernel evaluations, K := [K(xi , xj )]N
i,j=1 . Despite the theoretical rigour and
empirical success enjoyed by kernel-based learning algorithms, the requirement to manifest
and invert this matrix leads to notoriously poor O(N 3 ) time-complexity scaling. This has
spurred research dedicated to efficiently approximating K, the chief example of which is
random features (Rahimi and Recht, 2007): a Monte-Carlo approach which gives explicitly
manifested, finite dimensional vectors ϕ(xi ) ∈ Rm whose Euclidean dot product is equal to
the kernel evaluation in expectation,
Kij = E ϕ(xi )⊤ ϕ(xj ) .

(1)
This allows one to construct a low-rank decomposition of K which provides much better scal-
ability. Testament to its utility, a rich taxonomy of random features exists to approximate
many different Euclidean kernels including the Gaussian, softmax, and angular and linear
kernels (Johnson, 1984; Dasgupta et al., 2010; Goemans and Williamson, 2004; Choromanski
et al., 2020).
Kernels defined on discrete input spaces, e.g. K : N × N → R with N the set of nodes of
a graph G (Smola and Kondor, 2003; Kondor and Lafferty, 2002; Chung and Yau, 1999),
∗
Equal contribution.
1
Code is available at https://github.com/isaac-reid/general_graph_random_features.

1
Published as a conference paper at ICLR 2024

enjoy widespread applications including in bioinformatics (Borgwardt et al., 2005), commu-

nity detection (Kloster and Gleich, 2014) and recommender systems (Yajima, 2006). More
recently, they have been used in applications as diverse as manifold learning for deep gener-
ative modelling (Zhou et al., 2020) and for solving single- and multiple-source shortest path
problems (Crane et al., 2017). However, for these graph-based kernel methods the problem
of poor scalability is particularly acute. This is because even computing the corresponding
Gram matrix K is typically of at least cubic time complexity in the number of nodes N ,
requiring e.g. the inversion of an N × N matrix or computation of multiple matrix-matrix
products. Despite the presence of this computational bottleneck, random feature methods
for graph kernels have proved elusive. Indeed, only recently was a viable graph random fea-
ture (GRF) mechanism proposed by Choromanski (2023). Their algorithm uses an ensemble
of random walkers which deposit a ‘load’ at every vertex they pass through that depends on
i) the product of weights of edges traversed by the walker and ii) the marginal probability of
the subwalk. Using this scheme, it is possible to construct random features {ϕ(i)}N i=1 ⊂ R
N
⊤
such that ϕ(i) ϕ(j) gives an unbiased approximation to the ij-th matrix element of the
2-regularised Laplacian kernel. Multiple independent approximations can be combined to
estimate the d-regularised Laplacian kernel with d ̸= 2 or the diffusion kernel (although the
latter is only asymptotically unbiased). The GRFs algorithm enjoys both subquadratic time
complexity and strong empirical performance on tasks like k-means node clustering, and it
is trivial to distribute across machines when working with massive graphs.
However, a key limitation of GRFs is that they only address a limited family of graph
kernels which may not be suitable for the task at hand. Our central contribution is a simple
modification which generalises the algorithm to arbitrary functions of a weighted adjacency
matrix, allowing efficient and unbiased approximation a much broader class of graph node
kernels. We achieve this by introducing an extra modulation function f that controls each
walker’s load as it traverses the graph. As well as empowering practitioners to approximate
many more fixed kernels, we demonstrate that f can also be parameterised by a neural
network and learned. We use this powerful approach to optimise g-GRFs for higher-quality
approximation of fixed kernels and for scalable implicit kernel learning.
The remainder of the manuscript is organised as follows. In Sec. 2 we introduce gen-
eral graph random features (g-GRFs) and prove that they enable scalable approximation
of arbitrary functions of a weighted adjacency matrix, including many of the most popular
examples of kernels defined on the nodes of a graph. We also extend the core algorithm
with neural modulation functions, replacing one component of the g-GRF mechanism with
a neural network, and derive generalisation bounds for the corresponding class of learnable
graph kernels (Sec. 2.1). In Sec. 3 we run extensive experiments, including: pointwise
estimation of a variety of popular graph kernels (Sec. 3.1); simulation of time evolution un-
der non-homogeneous graph ordinary differential equations (Sec. 3.2); kernelised k-means
node clustering including on large graphs (Sec. 3.3); training a neural modulation function
to suppress the mean square error of fixed kernel estimates (Sec 3.4); and training a neu-
ral modulation function to learn a kernel for node attribute prediction on triangular mesh
graphs (Sec. 3.5).

2 General graph random features

Consider a directed weighted graph G(N , E, W := [wij ]i,j∈N ) where: i) N := {1, ..., N } is
the set of nodes; ii) E is the set of edges, with (i, j) ∈ E if there is a directed edge from i
to j in G; and iii) W is the weighted adjacency matrix, with wij the weight of the directed
edge from i to j (equal to 0 if no such edge exists). Note that an undirected graph can be
described as directed with the symmetric weighted adjacency matrix W. Now consider the
matrices Kα (W) ∈ RN ×N , where α = (αk )∞ k=0 and αk ∈ R:
∞
X
Kα (W) = αk Wk . (2)
k=0

We assume that the sum above converges for all W under consideration, which can be
ensured with a regulariser W → σW, σ ∈ R+ . Without loss of generality, we also assume

2
Published as a conference paper at ICLR 2024

that α is normalised such that α0 = 1. The matrix Kα (W) can be associated with a graph
G
function Kα : N × N → R mapping from a pair of graph nodes to a real number.
Note that if G is an undirected graph then Kα (W) automatically inherits the symmetry
of W. In this case, it follows from Weyl’s perturbation inequality (Bai et al., 2000) that
Kα (W) is positive semidefinite for any given α provided the spectral radius ρ(W) :=
maxλ∈Λ(W) (|λ|) is sufficiently small (with Λ(W) the set of eigenvalues of W). This can
again be ensured by multiplying the weight matrix W by a regulariser σ ∈ R+ . It then
G
follows that Kα (W) can be considered the Gram matrix of a graph kernel function Kα .
With suitably chosen α = (αk )∞ k=0 , the class described by Eq. 2 includes many popular
examples of graph node kernels in the literature (Smola and Kondor, 2003; Chapelle et al.,
2002). They measure connectivity between nodes and are typically functions of the graph
f := [wij / di dj ]N . Here, di := P wij is
p
Laplacian matrix, defined by L := I − W f with W
i,j=1 j
the weighted degree of node i such that Wf is the normalised weighted adjacency matrix. For
reference, Table 1 gives the kernel definitions and normalised coefficients αk (corresponding
f to be considered later in the manuscript. In practice, factors in αk equal
to powers of W)
to a quantity raised to the power of k are absorbed into the normalisation of W. f

Name Form αk
−k
(IN + σ 2 L)−d 1 + σ −2
d+k−1

d-regularised Laplacian k
p −k
p-step random walk (αIN − L)p , α ≥ 2 k
(α − 1)
1 σ2 k
Diffusion exp(−σ 2 L/2) n ( )
k! 2 o
k−1
π k k
1

Inverse Cosine cos (Lπ/4) k! 4
· (−1) 2 if k even, (−1) 2 if k odd

G
Table 1: Different graph functions/kernels Kα : N ×N → R. The exp and cos mappings are
defined via Taylor series expansions rather than element-wise, e.g. exp(M) := limn→∞ (IN +
M/n)n and cos(M) := Re(exp(iM)). σ and α are regularisers. Note that the diffusion
kernel is sometimes instead defined by exp(σ 2 (IN − L)) but these forms are equivalent up to
normalisation. Also note that the p-step random walk kernel is closely related to the graph
Matérn kernel (Borovitskiy et al., 2021).

The chief goal of this work is to construct a random feature map ϕ(i) : N → Rl with l ∈ N
that provides unbiased approximation of Kα (W) as in Eq. 1. To do so, we consider the
following algorithm.
Algorithm 1 Constructing a random feature vector ϕf (i) ∈ RN to approximate Kα (W)
Input: weighted adjacency matrix W ∈ RN ×N , vector of unweighted node degrees (no.
neighbours) d ∈ RN , modulation function f : (N ∪ {0}) → R, termination probability
phalt ∈ (0, 1), node i ∈ N , number of random walks to sample m ∈ N.
Output: random feature vector ϕf (i) ∈ RN
1: initialise: ϕf (i) ← 0
2: for w = 1, ..., m do
3: initialise: load ← 1
4: initialise: current_node ← i
5: initialise: terminated ← False
6: initialise: walk_length ← 0
7: while terminated = False do
8: ϕf (i)[current_node] ← ϕf (i)[current_node]+load×f ( walk_length )
9: walk_length ← walk_length+1
10: new_node ← Unif [N ( current_node )] ▷ assign to one of neighbours
11: load ← load× d[current_node]
1−phalt × W [ current_node,new_node ] ▷ update load
12: current_node ← new_node
13: terminated ← (t ∼ Unif(0, 1) < phalt ) ▷ draw RV t to decide on termination
14: end while
15: end for
16: normalise: ϕf (i) ← ϕf (i)/m

3
Published as a conference paper at ICLR 2024

f (i)

Figure 1: Schematic for a random walk on a graph (solid red) and an accompanying modu-
lation function f (dashed blue) used to approximate an arbitrary graph node function K G .

This is identical to the algorithm presented by Choromanski (2023) for constructing features
to approximate the 2-regularised Laplacian kernel, apart from the presence of the extra
modulation function f : (N∪{0}) → R in line 8 that upweights or downweights contributions
from walks depending on their length (see Fig. 1). We refer to ϕf as general graph random
features (g-GRFs), where the subscript f identifies the modulation function. Crucially, the
time complexity of Alg. 1 is subquadratic in the number of nodes N , in contrast to exact
methods which are O(N 3 ).2
We now state the following central result, proved in App. A.2.
Theorem 2.1 (Unbiased approximation of Kα via convolutions). For two modulation func-
N
tions: f1 , f2 : (N ∪ {0}) → R, g-GRFs ϕf1 (i))Ni=1 , (ϕf2 (i) i=1 constructed according to Alg.
1 give unbiased approximation of Kα ,
[Kα ]ij = E ϕf1 (i)⊤ ϕf2 (j) ,

(3)
for kernels with an arbitrary Taylor expansion α = (αk )∞k=0 provided that α = f1 ∗ f2 . Here,
∗ is the discrete convolution of the modulation functions f1 , f2 ; that is, for all k ∈ (N ∪ {0}),
k
X
f1 (k − p)f2 (p) = αk . (4)
p=0

Clearly the class of pairs of modulation functions f1 , f2 that satisfy Eq. 4 is highly degen-
erate. Indeed, it is possible to solve for f1 given any f2 and α provided f2 (0) ̸= 0. For
instance, a trivial solution is given by: f1 (i) = αi , f2 (i) = I(i = 0) with I(·) the indicator
function. In this case, the walkers corresponding to f2 are ‘lazy’, depositing all their load
at the node at which they begin. Contributions to the estimator ϕf1 (i)⊤ ϕf2 (j) only come
from walkers initialised at i traversing all the way to j rather than two walkers both passing
through an intermediate node. Also of great interest is the case of symmetric modulation
functions f1 = f2 , where now intersections do contribute. In this case, the following is true
(proved in App. A.3).
Theorem 2.2 (Computing symmetric modulation functions). Supposing f1 = f2 = f , Eq.
4 is solved by a function f which is unique (up to a sign) and is given by
i 1
X X n
f (i) = ± 2 α1k1 α2k2 α3k3 ... . (5)
n=0
n k +2k +3k ...=i k1 k2 k3 ...
1 2 3
k1 +k2 +k3 +...=n

Moreover, f (i) can be efficiently computed with the iterative formula

 √
f (0) = ± α0 = ±1
i−1
(6)
P
αi+1 − f (i−p)f (p+1)
p=0
f (i + 1) =
2f (0) for i ≥ 0.
N ×N
2
Concretely, Alg. 1 yields a pair of matrices ϕ1,2 := (ϕ(i))N
i=1 ∈ R such that K = E(ϕ1 ϕ⊤2 )
in subquadratic time. Of course, explicitly multiplying the matrices to evaluate every element of Kb
would be O(N 3 ), but we avoid this since in applications we just evaluate ϕ1 (ϕ⊤ 2 v) where v ∈ R
N

is some vector and the brackets give the order of computation. This is O(N 2 ).

4
Published as a conference paper at ICLR 2024

For symmetric modulation functions, the random features ϕf1 (i) and ϕf2 (i) are identical
apart from the particular sample of random walks used to construct them. They cannot
share the same sample or estimates of diagonal kernel elements [Kα ]ii will be biased.
Computational cost: Note that when running Alg. 1 one only needs to evaluate the
modulation functions f1,2 (i) up to the length of the longest walk one samples. A batch
of size b, (f1,2 (i))bi=1 , can be pre-computed in time O(b2 ) and reused for random features
corresponding to different nodes and even different graphs. Further values of f can be
computed at runtime if b is too small and also reused for later computations. Moreover, the
minimum length b required to ensure that all m walks are shorter than b with high probability
(Pr(∪mi=1 len(ωi ) ≤ b) > 1−δ, δ ≪ 1) scales only logarithmically with m (see App. A.1). This
means that, despite estimating a much more general family of graph functions, g-GRFs are
essentially no more expensive than the original GRF algorithm. Moreover, any techniques
used for dimensionality reduction of regular GRFs (e.g. applying the Johnson-Lindenstrauss
transform (Dasgupta et al., 2010) or using ‘anchor points’ (Choromanski, 2023)) can also
be used with g-GRFs, providing further efficiency gains.

Generating functions: Inserting the constraint for unbiasedness in Eq. 4 back into the
definition of Kα (W), we immediately have that
Kα (W) = Kf1 (W)Kf2 (W) (7)
P∞ i
where Kf1 (W) := i=0 f1 (i)W is the generating function corresponding to the sequence
(f1 (i))∞
i=0 . This is natural because the (discrete) Fourier transform of a (discrete) convolu-
tion returns the product of the (discrete) Fourier transforms of the respective functions. In
the symmetric case f1 = f2 , it follows that
1
Kf (W) = ± (Kα (W)) 2 . (8)

If the RHS has a simple Taylor expansion (e.g. Kα (W) = exp(W) so Kf (W) = exp( W 2 )),
this enables us obtain f without recourse to the conditional sum in Eq. 5 or the iterative
expression in Eq. 6. This is the case for many
popular graph kernels; we provide some promi-
Name f (i)
nent examples in the table left. A notable ex-
(d−2+2i)!!
d-regularised Laplacian (2i)!!(d−2)!!
ception is the inverse cosine kernel.
p
p-step random walk 2
i
As an interesting corollary, by considering
Diffusion 1 the diffusion kernel we have also proved that
2i i! Pk 1 1 1
p=0 2p p! 2k−p (k−p)! = k! .

2.1 Neural modulation functions, kernel learning and generalisation

Instead of using a fixed modulation function f to estimate a fixed kernel, it is possible to

parameterise it more flexibly. For example, we can define a neural modulation function
f (N ) : (N ∪ {0}) → R by a neural network (with a restricted domain) whose input and
output dimensionalities are equal to 1. During training, we can choose the loss function
to target particular desiderata of g-GRFs: for example, to suppress the mean square error
of estimates of some particular fixed kernel (Sec. 3.4), or to learn a kernel which performs
better in a downstream task (Sec. 3.5). Implicitly learning Kα via f (N ) is more scalable
than learning Kα directly because it obviates the need to repeatedly compute the exact
kernel, which is typically of O(N 3 ) time complexity. Since any modulation function f maps
to a unique α by Eq. 4, it is also always straightforward to recover the exact kernel which
the g-GRFs estimate, e.g. once the training is complete.
G
Supposing we have (implicitly) learned α, how can the learned kernel Kα be expected to
generalise? Let ψKα : x → HKα denote the feature mapping from the input space to the
G
reproducing kernel Hilbert space HKα induced by the kernel Kα . Define the hypothesis set
(M )
H = {x → w⊤ ψKα (x) : |αi | ≤ αi , ∥w∥2 ≤ 1}, (9)
where we restricted our family of kernels so that the absolute value of each Taylor coefficient
(M )
αi is smaller than some maximum value αi . Following very similar arguments to Cortes
et al. (2010), the following is true.

5
Published as a conference paper at ICLR 2024

Theorem 2.3 (Empirical Rademacher complexity bound). For a fixed sample S = (xi )m
i=1 ,
the empirical Rademacher complexity R(H) is bounded by
b
v
u ∞
u1 X (M )
R(H) ≤
b t α ρ(W)i , (10)
m i=0 i
where ρ(W) is the spectral radius of the weighted adjacency matrix W.

Naturally, the bound on R(H)

b increases monotonically with ρ(W). Following standard ar-
guments in the literature, this immediately yields generalisation bounds for learning kernels
G
Kα . We discuss this in detail, including proving Theorem 2.3, in App. A.4.

3 Experiments
Here we test the empirical performance of g-GRFs, both for approximating fixed kernels
(Secs 3.1-3.3) and with learnable neural modulation functions (Secs 3.4-3.5).

3.1 Unbiased pointwise estimation of fixed kernels

We begin by confirming that g-GRFs do indeed give unbiased estimates of the graph kernels
listed in Table 1, taking regularisers σ = 0.25 and α = 20 with phalt = 0.1. We use symmetric
modulation functions f , computed with the closed forms where available and using the
iterative scheme in Eq. 6 if not. Fig. 2 plots the relative Frobenius norm error between
the true kernels K and their approximations with g-GRFs K b (that is, ∥K − K∥b F /∥K∥F )
against the number of random walkers m. We consider 8 different graphs: a small random
Erdős-Rényi graph, a larger Erdős-Rényi graph, a binary tree, a d-regular graph and 4 real
world examples (karate, dolphins, football and eurosis) (Ivashkin, 2023). They vary
substantially in size. For every graph and for all kernels, the quality of the estimate improves
as m grows and becomes very small with even a modest number of walkers.

ER (N = 20, p = 0.2) ER (N = 100, p = 0.04) Binary tree (N = 127) d-regular (N = 100, d = 10)
Frob. norm error

Frob. norm error

10 3
10 3 10 3 10 3

1-reg Lap
2-reg Lap
10 4 2-step RW 10 4 10 4 10 4
Diffusion
Cosine
5 10 15 5 10 15 5 10 15 5 10 15
No. random walks No. random walks No. random walks No. random walks
Karate (N = 34) Dolphins (N = 62) Football (N = 115) Eurosis (N = 1272)
Frob. norm error

Frob. norm error

10 3 10 3 10 3 10 3

10 4
10 4 10 4 10 4

5 10 15 5 10 15 5 10 15 5 10 15
No. random walks No. random walks No. random walks No. random walks

Figure 2: Unbiased estimation of popular kernels on different graphs using g-GRFs. The
approximation error (y-axis) improves with the number of walkers (x-axis). We repeat 10
times; one standard deviation of the mean error is shaded.

3.2 Solving differential equations on graphs

An intriguing application of g-GRFs for fixed kernels is efficiently computing approximate

solutions of time-invariant non-homogeneous ordinary differential equations (ODEs) on
graphs. Consider the following ODE defined on the nodes N of the graph G:
dx(t)
= Wx(t) + y(t), (11)
dt

6
Published as a conference paper at ICLR 2024

where x(t) ∈ RN is the state of the graph at time t, W ∈ RN ×N is a weighted adjacency

matrix and y(t) is a (known) driving term. Assuming the null initial condition x(0) = 0,
Eq. 11 is solved by the convolution
Z t
1
x(t) = exp(W(t − τ ))y(τ )dτ = Eτ ∈P exp(W(t − τ ))y(τ ) (12)
0 p(τ )
where P is a probability distribution on the interval [0, t], equipped with a (preferably
efficient) sampling mechanism and probability density function p(τ ). Taking n ∈ N Monte
i.i.d.
Carlo samples (τi )ni=1 ∼ P, we can construct the unbiased estimator:
n
1X 1
b(t) :=
x exp(W(t − τj ))y(τj ). (13)
n j=1 p(τj )
Note that exp(W(t − τj )) is nothing other than the diffusion kernel, which is expensive
to compute exactly for large N but can be efficiently approximated with g-GRFs. Take
exp(W(t − τj )) ≃ Φj Φ⊤ N
j with Φj := (ϕ(i))i=1 an N × N matrix whose rows are g-GRFs
constructed to approximate the kernel at a particular τj . Then we have that
n
1X 1
x(t)
b := Φj Φ⊤
j y(τj ) (14)
n j=1 p(τj )
which can be computed in quadratic time (c.f. cubic if the heat kernel is computed exactly).
Further speed gains are possible if dimensionality reduction techniques are applied to the
g-GRFs (Choromanski, 2023; Dasgupta et al., 2010).
As an example, we consider diffusion on three Diffusion simulation error vs. no. walkers
real-world graphs with a fixed source at one 0.10 karate
node, taking W = L (the graph Laplacian) and dolphins
football
⊤ 0.08
y(t) = y = (1, 0, 0, ...) . The steady state is
x(∞) = W−1 (−y). We simulate evolution un-
Sim. error

0.06
der the ODE for t = 1 with n = 10 discreti-
sation timesteps and P uniform, approximating
0.04
exp(W(t − τj )) with different numbers of walk-
ers m. As m grows, the quality of approxima- 0.02
tion improves and the (normalised) error on the
final state ∥b
x(1) − x(1)∥2 /∥x(1)∥2 drops for ev- 2 4 6 8 10
No. walkers
12 14 16
ery graph. We take 100 repeats for statistics and
plot the results in Fig. 3. One standard devia- Figure 3: ODE simulation error decreases
tion of the mean error is shaded. phalt = 0.1 and as the number of walkers grows.
the regulariser is given by σ = 1.

3.3 Efficient kernelised graph node clustering

As a further demonstration of the utility of Table 2: Errors in kernelised k-means

our new mechanism, we show how estimates clustering when approximating the kernel
of the kernel K = exp(σ 2 W) can be used exp(σ 2 W) with g-GRFs.
to assign nodes to k = 3 clusters. Here, we
choose W to be the (unweighted) adjacency Graph N Clustering error, Ec
matrix and the regulariser is σ 2 = 0.2. We
karate 34 0.08
follow the algorithm proposed by Dhillon dolphins 62 0.16
et al. (2004), comparing the clusters when polbooks 105 0.12
we use exact and g-GRF-approximated ker- football 115 0.02
nels. For all graphs, m ≤ 80. Table 2 re- databases 1046 0.10
ports the clustering error, defined by eurosis 1272 0.09
no. wrong pairs cora 2485 0.01
Ec := . (15) citeseer 3300 0.04
N (N − 1)/2
This is the number of misclassified pairs of nodes (assigned to the same cluster when the
converse is true or vice versa) divided by the total number of pairs. The error is small even
with a modest number of walkers and on large graphs; kernel estimates efficiently constructed
using g-GRFs can be readily deployed on downstream tasks where exact methods are slow.

7
Published as a conference paper at ICLR 2024

3.4 Learning f (N ) for better kernel approximation

Following the discussion in Sec. 2.1 we now replaced fixed f with a neural modulation
function f (N ) parameterised by a simple neural network with 1 hidden layer of size 1:
f (N ) (x) = σsoftplus (w2 σReLU (w1 x + b1 ) + b2 ) , (16)
where w1 , b1 , w2 , b2 ∈ R and σReLU and σsoftplus are the ReLU and softplus activation func-
tions, respectively. Bigger, more expressive architectures (including allowing f (N ) to take
negative values) can be used but this is found to be sufficient for our purposes.
We define our loss function to be the Frobenius norm error between a target Gram matrix
and our g-GRF-approximated Gram matrix on the small Erdős-Rényi graph (N = 20) with
m = 16 walks. For the target, we choose the 2-regularised Laplacian kernel. We train
(N ) (N )
symmetric f1 = f2 but provide a brief discussion of the asymmetric case (including its
respective strengths and weaknesses) in App. A.5. On this graph, we minimise the loss with
the Adam optimiser and a decaying learning rate (LR = 0.01, γ = 0.975, 1000 epochs). We
make the following striking observation: f (N ) does not generically converge to the unique
unbiased (symmetric) modulation function implied by α, but instead to some different f
that though biased gives a smaller mean squared error (MSE). This is possible because by
downweighting long walks the learned f (N ) gives estimators with a smaller variance, which
is sufficient to suppress the MSE on the kernel matrix elements even though it no longer
gives the target value in expectation. We then fix f (N ) and use it for kernel approximation
on the remaining graphs. The learned, biased f (N ) still provides better kernel estimates,
including for graphs with very different topologies and a much greater number of nodes:
eurosis is bigger by a factor of over 60. See Table 3 for the results. phalt = 0.5 and σ = 0.8.
Naturally, the learned f (N ) is dependent on the the number of random walks m; as m grows,
the variance on the kernel approximation drops so it is intuitive that the learned f (N ) will
approach the unbiased f . Fig. 4 empirically confirms this is the case, showing the learned
f (N ) for different numbers of walkers. The line labelled ∞ is the unbiased modulation
function, which for the 2-regularised Laplacian kernel is constant.

Table 3: Kernel approximation error with Figure 4: Learned modulation function

m = 16 walks and unbiased or learned mod- with different numbers of random walkers
ulation functions. Lower is better. Brackets m. It approaches the unbiased f (N ) as
give one standard deviation of the last digit. m→∞
Modulation functions for different no. walkers
Graph N Frob. norm error on K
b 1.0
Unbiased Learned 2
0.8 4
Small ER 20 0.0488(9) 0.0437(9) 8
Larger ER 100 0.0503(4) 0.0448(4) 0.6
16
32
Binary tree 127 0.0453(4) 0.0410(4)
f(i)

64
d-regular 100 0.0490(2) 0.0434(2) 0.4 128
karate 34 0.0492(6) 0.0439(6) 256
512
dolphins 62 0.0505(5) 0.0449(5) 0.2
football 115 0.0520(2) 0.0459(2)
eurosis 1272 0.0551(2) 0.0484(2) 0.0
0 2 4 6 8
i

These learned modulation functions might guide the more principled construction of biased,
low-MSE GRFs in the future. An analogue in Euclidean space is provided by structured
orthogonal random features (SORFs), which replace a random Gaussian matrix with a HD-
product to estimate the Gaussian kernel (Choromanski et al., 2017; Yu et al., 2016). This
likewise improves kernel approximation quality despite breaking estimator unbiasedness.

3.5 Implicit kernel learning for node attribute prediction

As suggested in Sec. 2.1, it is also possible to train the neural modulation function f (N )
directly using performance on a downstream task, performing implicit kernel learning. We
have argued that this is much more scalable than optimising Kα directly.

8
Published as a conference paper at ICLR 2024

In this spirit, we now address the problem of kernel regression on triangular mesh graphs,
as previously considered by Reid et al. (2023). For graphs in this dataset (Dawson-
Haggerty, 2023), every node is associated with a normal vector v (i) ∈ R3 equal to the
mean of the normal vectors of its 3 surrounding faces. The task is to predict the direc-
tions of missing vectors (a random P b5% split) from the remainder. Our (unnormalised)
predictions are given by v b(i) := j K (N )
(i, j)v (j) , where K b (N ) is a kernel estimate con-
structed using g-GRFs with a neural modulation function f (N ) (see Eq. 3). The an-
gular prediction error is 1 − cos θi with θi the angle between the true v (i) and and
approximate v b(i) normals, averaged over the
missing vectors. We train a symmetric pair f (N ) Modulation functions for regression task
1.0
using this angular prediction error on the small
cylinder graph (N = 210) as the loss function. 0.8
Then we freeze f (N ) and compute the angular 0.6 learned
prediction error for other larger graphs. Fig. 5 1-reg Lap

f(i)
2-reg Lap
shows the learned f (N ) as well as some other 0.4 diffusion
modulation functions corresponding to popular
0.2
fixed kernels. Note also that learning f (N ) al-
ready includes (but is not limited to) optimis- 0.0
ing the lengthscale of a given kernel: taking 0 2 4 6 8
f → βW f is identical to f (i) → f (i)β i ∀ i. i
W
Figure 5: Fixed and learned modulation
The prediction errors are highly correlated be- functions for kernel regression
tween the different modulation functions for a
given random draw of walks; ensembles which ‘explore’ the graph poorly, terminating quickly
or repeating edges, will give worse g-GRF estimators for every f . For this reason, we com-
pute the prediction errors as the average normalised difference compared to the learned
kernel result. Table 4 reports the results. Crucially, this difference is found to be positive
for every graph and every fixed kernel, meaning the learned kernel always performs best.
Table 4: Normalised difference in angular prediction error compared to the learned
kernel defined by f (N ) (trained on the cylinder graph). All entries are positive since
the learned kernel always performs best. We take 100 repeats (but only 10 for the very
large cycloidal graph). The brackets give one standard deviation on the final digit.

Graph N Normalised ∆(pred error) c.f. learned

1-reg Lap 2-reg Lap Diffusion
cylinder 210 +0.40(2) +0.85(4) +0.029(3)
teapot 480 +0.81(5) +1.78(8) +0.059(3)
idler-riser 782 +0.52(2) +1.12(1) +0.042(2)
busted 1941 +0.81(2) +1.60(4) +0.063(2)
torus 4350 +2.13(5) +5.3(1) +0.067(2)
cycloidal 21384 +0.065(4) +0.13(1) +0.011(1)

It is remarkable that the learned f (N ) gives the smallest error for all the graphs even
though it was just trained on cylinder, the smallest one. We have implicitly learned a
good kernel for this task which generalises well across topologies. It is also intriguing that
the diffusion kernel performs only slightly worse. This is to be expected because their
modulation functions are similar (see Fig. 5) so they encode very similar kernels, but this
will not always be the case depending on the task at hand.

4 Conclusion
We have introduced ‘general graph random features’ (g-GRFs), a novel random walk-based
algorithm for time-efficient estimation of arbitrary functions of a weighted adjacency matrix.
The mechanism is conceptually simple and trivially distributed across machines, unlocking
kernel-based machine learning on very large graphs. By parameterising one component of
the random features with a simple neural network, we can futher suppress the mean square
error of estimators and perform scalable implicit kernel learning.

9
Published as a conference paper at ICLR 2024

5 Ethics and reproducibility

Ethics: Our work is foundational. There are no direct ethical concerns that we can see,
though of course increases in scalability afforded by graph random features might amplify
risks of graph-based machine learning, from bad actors or as unintended consequences.
Reproducibility: To foster reproducibility, we clearly state the central algorithm in Alg. 1.
Source code is available at https://github.com/isaac-reid/general_graph_random_features.
All theoretical results are accompanied by proofs in Appendices A.1-A.4, where any assump-
tions are made clear. The datasets we use correspond to standard graphs and are freely
available online. We link suitable repositories in every instance. Except where prohibitively
computationally expensive, results are reported with uncertainties to help comparison.

6 Relative contributions and acknowledgements

EB initially proposed using a modulation function to generalise GRFs to estimate the dif-
fusion kernel and derived its mathematical expression. IR and KC then developed the full
g-GRFs algorithm for general functions of a weighted adjacency matrix, proving all theoret-
ical results, running the experiments and preparing the manuscript. AW provided helpful
discussions and advice.
IR acknowledges support from a Trinity College External Studentship. AW acknowledges
support from a Turing AI fellowship under grant EP/V025279/1 and the Leverhulme Trust
via CFI.
We thank Kenza Tazi and Austin Tripp for their careful readings of the manuscript. Richard
Turner provided valuable suggestions and support throughout the project.

References
Zhaojun Bai, James Demmel, Jack Dongarra, Axel Ruhe, and Henk van der Vorst. Templates
for the solution of algebraic eigenvalue problems: a practical guide. SIAM, 2000. URL
https://www.cs.ucdavis.edu/~bai/ET/contents.html.

Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk
bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482,
2002. URL https://dl.acm.org/doi/pdf/10.5555/944919.944944.

Karsten M Borgwardt, Cheng Soon Ong, Stefan Schönauer, SVN Vishwanathan, Alex J
Smola, and Hans-Peter Kriegel. Protein function prediction via graph kernels. Bioinfor-
matics, 21(suppl_1):i47–i56, 2005. URL https://doi.org/10.1093/bioinformatics/
bti1007.

Viacheslav Borovitskiy, Iskander Azangulov, Alexander Terenin, Peter Mostowsky, Marc

Deisenroth, and Nicolas Durrande. Matérn gaussian processes on graphs. In International
Conference on Artificial Intelligence and Statistics, pages 2593–2601. PMLR, 2021. URL
https://doi.org/10.48550/arXiv.2010.15538.

Colin Campbell. Kernel methods: a survey of current techniques. Neurocomputing, 48(1-4):

63–84, 2002. doi: 10.1016/S0925-2312(01)00643-9. URL https://doi.org/10.1016/
S0925-2312(01)00643-9.

Stéphane Canu and Alexander J. Smola. Kernel methods and the exponential fam-
ily. Neurocomputing, 69(7-9):714–720, 2006. doi: 10.1016/j.neucom.2005.12.009. URL
https://doi.org/10.1016/j.neucom.2005.12.009.

Olivier Chapelle, Jason Weston, and Bernhard Schölkopf. Cluster kernels for semi-supervised
learning. Advances in neural information processing systems, 15, 2002. URL https:
//dl.acm.org/doi/10.5555/2968618.2968693.

10
Published as a conference paper at ICLR 2024

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane,
Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al.
Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020. URL
https://doi.org/10.48550/arXiv.2009.14794.
Krzysztof M Choromanski, Mark Rowland, and Adrian Weller. The unreasonable effec-
tiveness of structured random orthogonal embeddings. Advances in neural information
processing systems, 30, 2017. URL https://doi.org/10.48550/arXiv.1703.00864.
Krzysztof Marcin Choromanski. Taming graph kernels with random features. In In-
ternational Conference on Machine Learning, pages 5964–5977. PMLR, 2023. URL
https://doi.org/10.48550/arXiv.2305.00156.
Fan R. K. Chung and Shing-Tung Yau. Coverings, heat kernels and spanning trees. Electron.
J. Comb., 6, 1999. doi: 10.37236/1444. URL https://doi.org/10.37236/1444.
Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Generalization bounds for
learning kernels. In Proceedings of the 27th Annual International Conference on Machine
Learning (ICML 2010), 2010. URL http://www.cs.nyu.edu/~mohri/pub/lk.pdf.
Keenan Crane, Clarisse Weischedel, and Max Wardetzky. The heat method for distance
computation. Communications of the ACM, 60(11):90–99, 2017. URL https://dl.acm.
org/doi/10.1145/3131280.
Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse johnson: Lindenstrauss trans-
form. In Leonard J. Schulman, editor, Proceedings of the 42nd ACM Symposium on The-
ory of Computing, STOC 2010, Cambridge, Massachusetts, USA, 5-8 June 2010, pages
341–350. ACM, 2010. doi: 10.1145/1806689.1806737. URL https://doi.org/10.1145/
1806689.1806737.
Michael Dawson-Haggerty. Trimesh repository, 2023. URL https://github.com/mikedh/
trimesh.
Inderjit S Dhillon, Yuqiang Guan, and Brian Kulis. Kernel k-means: spectral clustering
and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference
on Knowledge discovery and data mining, pages 551–556, 2004. URL https://dl.acm.
org/doi/10.1145/1014052.1014118.
Michel X. Goemans and David P. Williamson. Approximation algorithms for max -3-cut
and other problems via complex semidefinite programming. J. Comput. Syst. Sci., 68
(2):442–470, 2004. doi: 10.1016/j.jcss.2003.07.012. URL https://doi.org/10.1016/j.
jcss.2003.07.012.
Vladimir Ivashkin. Community graphs repository, 2023. URL https://github.com/
vlivashkin/community-graphs.
William B Johnson. Extensions of lipschitz mappings into a hilbert space. Contemp. Math.,
26:189–206, 1984. URL http://stanford.edu/class/cs114/readings/JL-Johnson.
pdf.
Kyle Kloster and David F Gleich. Heat kernel based community detection. In Proceedings
of the 20th ACM SIGKDD international conference on Knowledge discovery and data
mining, pages 1386–1395, 2014. URL https://doi.org/10.48550/arXiv.1403.3148.
Vladimir Koltchinskii and Dmitry Panchenko. Empirical margin distributions and bounding
the generalization error of combined classifiers. The Annals of Statistics, 30(1):1–50, 2002.
URL https://doi.org/10.48550/arXiv.math/0405343.
Risi Kondor and John D. Lafferty. Diffusion kernels on graphs and other discrete input
spaces. In Claude Sammut and Achim G. Hoffmann, editors, Machine Learning, Proceed-
ings of the Nineteenth International Conference (ICML 2002), University of New South
Wales, Sydney, Australia, July 8-12, 2002, pages 315–322. Morgan Kaufmann, 2002. URL
https://www.ml.cmu.edu/research/dap-papers/kondor-diffusion-kernels.pdf.

11
Published as a conference paper at ICLR 2024

Leonid Kontorovich, Corinna Cortes, and Mehryar Mohri. Kernel methods for learning
languages. Theor. Comput. Sci., 405(3):223–236, 2008. doi: 10.1016/j.tcs.2008.06.037.
URL https://doi.org/10.1016/j.tcs.2008.06.037.
Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances
in neural information processing systems, 20, 2007. URL https://dl.acm.org/doi/10.
5555/2981562.2981710.
Isaac Reid, Krzysztof Choromanski, and Adrian Weller. Quasi-monte carlo graph random
features. arXiv preprint arXiv:2305.12470, 2023. URL https://doi.org/10.48550/
arXiv.2305.12470.
Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In Learning The-
ory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel
Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceed-
ings, pages 144–158. Springer, 2003. URL https://people.cs.uchicago.edu/~risi/
papers/SmolaKondor.pdf.
Alexander J. Smola and Bernhard Schölkopf. Bayesian kernel methods. In Shahar Mendel-
son and Alexander J. Smola, editors, Advanced Lectures on Machine Learning, Machine
Learning Summer School 2002, Canberra, Australia, February 11-22, 2002, Revised Lec-
tures, volume 2600 of Lecture Notes in Computer Science, pages 65–117. Springer, 2002.
URL https://doi.org/10.1007/3-540-36434-X_3.
Yasutoshi Yajima. One-class support vector machines for recommendation tasks. In Pacific-
Asia Conference on Knowledge Discovery and Data Mining, pages 230–239. Springer,
2006. URL https://dl.acm.org/doi/10.1007/11731139_28.
Felix Xinnan X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N
Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. Advances in neural in-
formation processing systems, 29, 2016. URL https://doi.org/10.48550/arXiv.1610.
09072.
Yufan Zhou, Changyou Chen, and Jinhui Xu. Learning manifold implicitly via explicit heat-
kernel learning. Advances in Neural Information Processing Systems, 33:477–487, 2020.
URL https://doi.org/10.48550/arXiv.2010.01761.

12
Published as a conference paper at ICLR 2024

A Appendices
A.1 Minimum batch size scales logarithmically with number of walks
m
Consider an ensemble of m random walks S := {ωi }i=1 whose lengths are sampled from a
geometric distribution with termination probability p. Trivially, Pr(len(ωi ) < b) = 1 − (1 −
p)b . Given m independent walkers,
Pr (maxωi ∈S (len(ωi ) < b) = Pr (∪m b m
i=1 len(ωi ) < b) = (1 − (1 − p) ) . (17)
Take ‘with high probability’ to mean with probability at least 1 − δ, with δ ≪ 1 fixed. Then
1
log(1 − (1 − δ) m ) log(δ) − log(m)
b= ≃ . (18)
log(1 − p) log(1 − p)
As the number of walkers m grows, the minimum value of b to ensure that all walkers are
shorter than b with high probability scales logarithmically with m.

A.2 Proof of Theorem 2.1 (Unbiased approximation of Kα via convolutions)

We begin by proving that g-GRFs constructed according to Alg. 1 give unbiased estimation
of graph functions with Taylor coefficients (αi )∞
i=0 provided the discrete convolution relation
in Eq. 4 is fulfilled.
(i)
Denote the set of m walks sampled out of node i by {Ω̄k }m k=1 . Carefully considering Alg.
1, it is straightforward to convince oneself that the g-GRF estimator takes the form
m
1 X X ω
e (ωiv )f (len(ωiv )) (i)
ϕ(i)v := I(ωiv ∈ Ω̄k ). (19)
m p(ωiv )
k=1 ωiv ∈Ωiv

Here, Ωiv denotes the set of all graph walks between nodes i and v, of which each ωiv is
a member. len(ωiv ) is a function that gives the length of walk ωiv and ω e (ω ) evaluates
len(ωiv )
Qlen(ωiv ) 1 iv
the products of edge weights it traverses. p(ωiv ) = (1 − p) i=1 di denotes the
walk’s marginal probability. I is the indicator function which evaluates to 1 if its argument
(i)
is true (namely, walk ωiv is a prefix subwalk
h of Ω̄ki , the kth walk sampled out of i) and 0
(i)
otherwise. Trivially, we have that E I(ωiv ∈ Ω̄k ) = p(ωiv ) (by construction to make the
estimator unbiased).3
Now note that:
" #
X
⊤

E ϕ(i) ϕ(j) = E ϕ(i)v ϕ(j)v
v∈N
X X X
= ω ω (ωjv )f (len(ωjv ))
e (ωiv )f (len(ωiv ))e
v ωiv ∈Ωiv ωjv ∈Ωjv
∞ X
XX ∞
l1 l2
= Wiv Wjv f (l1 )f (l2 )
(20)
v l1 =0 l2 =0
∞ X
XX l1
l1 −l3 l3
= Wiv Wjv f (l1 − l3 )f (l3 )
v l1 =0 l3 =0
∞
X l1
X
l1
= Wij f (l1 − l3 )f (l3 ).
l1 =0 l3 =0

3
To flag a subtle point: in the sum we take ωiv ∈ Ωiv to mean that ωiv is a member of the set of
all possible walks of any length between nodes i and v, Ωiv . On the other hand, inside the indicator
(i) (i)
function by ωiv ∈ Ω̄k we mean that ωiv is a prefix subwalk of one particular walk Ω̄k sampled
out of node i. In these two cases the interpretation of the symbol ∈ should be slightly different.

13
Published as a conference paper at ICLR 2024

From the first to the second line we used the definition of GRFs in Eq. 19. We then rewrote
the sum of the product of edge weights over all possible paths as powers of the weighted
adjacency matrix W, with l1,2 corresponding to walk lengths. To get to the fourth line, we
changed indices in the infinite sums, then the final line followed simply.
P∞
The final expression in Eq. 20 is exactly equal to Kα (W) := l1 =0 αl1 Wl1 provided we
have that
Xl1
f (l1 − l3 )f (l3 ) := αl1 , (21)
l3 =0

as stated in Eq. 4 of the main text (with variables renamed to k and p).

A.3 Proof of Theorem 2.2 (Computing symmetric modulation functions)

Here, we show how to compute f under the constraint that the modulation functions are
identical, f1 = f2 = f . We will use the relationship in Eq. 4, reproduced below for the
reader’s convenience:
Xi
f (i − p)f (p) = αi . (22)
p=0
2
√ form in Eq. 6 is easy to show. For i = 0 we have that f (0) = α0 , so
The iterative
f (0) = ± α0 = ±1 (where we used the normalisation condition α0 = 1). Now note that
i+1
X i
X
f (i + 1 − p)f (p) = 2f (0)f (i + 1) + f (i + 1 − p)f (p) = αi+1 (23)
p=0 p=1

so clearly
Pi−1
αi+1 − p=0 f (i − p)f (p + 1)
f (i + 1) = . (24)
2f (0)
This enables us to compute f (i + 1) given αi+1 and (f (p))ip=0 .
The analytic form in Eq. 5 is only a little harder. Inserting the discrete convolution relation
in Eq. 4 back into Eq. 2, we have that
Kα (W) = Kf1 (W)Kf2 (W) (25)
P∞
where Kf1 (W) := i=0 f1 (i)Wi is the generating function corresponding to the sequence
(f1 (i))∞
i=0 . Constraining f1 = f2 ,
1
Kf (W) = ± (Kα (W)) 2 . (26)
We also discuss this in the ‘generating functions’ section of Sec. 2 where we use it to derive
simple closed forms for f for some special kernels. Now we have that
∞ ∞
! 21
X X
f (i)Wi = ± αn W n
i=0 n=0
1 (27)
= ± 1 + α1 W + α2 W2 + ... 2
∞ 1
X n
=± 2 α1 W + α2 W2 + ... .
n=0
n

We need to equate powers of W between the generating functions. Consider the terms
proportional to Wi . Clearly no such terms will feature when n > i, so we can n
restrict the
sum on the RHS to 0 ≤ n ≤ i. Meanwhile, the term in α1 W + α2 W2 + ... proportional
to Wi is nothing other than

X n
α1k1 α2k2 α3k3 ... (28)
k +2k +3k ...=i
k1 k2 k3 ...
1 2 3
k1 +k2 +k3 +...=n

14
Published as a conference paper at ICLR 2024

n

where k1 k2 k3 ... is the multinomial coefficient. Combining,
i 1
X X n
f (i) = ± 2 α1k1 α2k2 α3k3 ... (29)
n=0
n k1 +2k2 +3k3 ...=i
k1 k2 k3 ...
k1 +k2 +k3 +...=n

as shown in Eq. 5. Though this expression gives f purely in terms of α, the presence of the
conditional sum limits its practical utility compared to the iterative form.

A.4 Proof of Theorem 2.3 (Empirical Rademacher complexity bound)

In this appendix we derive the bound on the empirical Rademacher complexity stated in
Theorem 2.3 and show the consequent generalisation bounds. The early stages closely follow
arguments made by Cortes et al. (2010). Recall that we have defined the hypothesis set
(M )
H = {x → w⊤ ψKα (x) : |αi | ≤ αi , ∥w∥2 ≤ 1}, (30)
with ψKα : x → HKα the feature mapping from the input space to the reproducing kernel
G
Hilbert space HKα induced by the kernel Kα .
The empirical Rademacher complexity R(H)
b for an arbitrary fixed sample S = (xi )m
i=1 is
defined by " #
m
1 X
R(H)
b := Eσ sup σi h(xi ) (31)
m h∈H i=1

the expectation is taken over σ = (σ1 , ..., σm ) with σi ∼ Unif(±1) i.i.d. Rademacher random
variables.
Begin by noting that
m
X
h(x) := w⊤ ψKα (x) = βi Kα (xi , x) (32)
i=1
where βi are the coordinates of the orthogonal projections of w on HS =
span(ψKα (x1 ), ..., ψKα (xm )), where β ⊤ Kα β ≤ 1. Then we have that
" #
1
R(H)
b = Eσ sup σ ⊤ Kα β . (33)
m α,β

1/2 1/2
The supremum supβ σ ⊤ Kα β is reached when Kα β is collinear with Kα σ, and making
∥β∥2 as large as possible gives

1 p
⊤
R(H) = Eσ sup σ Kα σ .
b (34)
m α

Now note that

∞
X
σ ⊤ Kα σ = αi σ ⊤ Wi σ. (35)
i=0
(M )
σ ⊤ Wi σ may take either sign, and the sum is maximised by taking αi = αi for positive
(M )
terms and αi = −αi for negative terms. Observe that
v
u ∞
p u X (M )
⊤
sup sup σ Kα σ = tm αi ρ(W)i (36)
σ α
i=0

whereupon from Eq. 31 it follows that

v
u ∞
u1 X (M )
R(H)
b ≤t α ρ(W)i (37)
m i=0 i

15
Published as a conference paper at ICLR 2024

as stated in Thm 2.3. This bound is not tight for general graphs, but will be for specific ex-
amples: for example, when W is proportional to the identity so the only edges are self-loops.
Nonetheless, it provides some intuition for how the learned kernel’s complexity depends on
W.
As stated in the main text, Eq. 37 immediately yields generalisation bounds for learning
kernels. Again closely following Cortes et al. (2010), consider the application of binary clas-
sification where nodes are assigned a label yi = ±1. Denote by R(h) the true generalisation
error of h ∈ H,
R(h) = Pr(yh(x) < 0). (38)
m
Consider a training sample S = ((xi , yi ))i=1 , and define the ρ-empirical margin loss by
m
bρ (h) := 1
X
R min(1, [1 − yi h(xi )/ρ]+ ) (39)
m i=1

where ρ > 0. For any δ > 0, with probability at least 1 − δ, the following bound holds for
any h ∈ H (Bartlett and Mendelson, 2002; Koltchinskii and Panchenko, 2002):
s
bρ (h) + 2 R(H) log 2δ
R(h) ≤ R b +3 . (40)
ρ 2m
Inserting the bound on the empirical Rademacher complexity in Eq. 37, we immediately
have that v s
∞
log 2δ
u
bρ (h) + 2 t 1 (M )
u X
R(h) ≤ R αi ρ(W)i + 3 (41)
ρ m i=0 2m

which shows how the generalisation error can be controlled via α(M ) or ρ(W).

A.5 Learning asymmetric f (N )

Recalling from Eq. 1 that the modulation func- Learned asymmetric modulation functions
tions (f1 , f2 ) do not necessarily need to be equal 1.0
for unbiased estimation of some kernel Kα , a
natural extension of Sec. 3.4 is to introduce 0.8
(N ) initial f1
two separate neural modulation functions f1,2
0.6 initial f2
and train them both following the scheme in
f(i)

trained f1
Sec. 3.4. Intriguingly, even with an initialisa- trained f2
(N ) 0.4
tion where f2 encodes ‘lazy’ behaviour (de-
posits almost all its load at the starting node 0.2
(N )
– see Sec. 2) and f1 is flat, upon training
the neural modulation functions quickly become 0 2 4 6 8
i
very similar (though not identical). See Fig. 6.
We use the same optimisation hyperparameters Figure 6: Modulation functions (f , f )
1 2
and network architectures as in Sec. 3.4. Rigor- parameterised by separate neural net-
ously proving the best possible choice of (f1 , f2 ), works before and after training to target
including whether e.g. a symmetric pair is opti- the 2-regularised Laplacian kernel.
mal, is left as an exciting open theoretical prob-
lem. We note that, whilst parameterising two separate neural modulation functions gives
a more general mechanism (with the symmetric pair as a special case), it also doubles the
number of parameters required so slows training and evaluation.

A.6 Further approximation error results for Erdős-Rényi graphs

Here we further investigate the behaviour of the g-GRFs kernel approximation as the prop-
erties of the graph change. To do this, we generate random graphs using the Erdős-Rényi
model, where each of the N2 possible edges are present with probability pedge or absent
with probability 1 − pedge . Edges are independent.

16
Published as a conference paper at ICLR 2024

Firstly, we investigate how the approximation error varies with graph sparsity. We take
a fixed number of nodes (N = 100) and control the sparsity by varying pedge between
0.1 and 0.9. We then approximate the diffusion kernel using g-GRFs with a termination
probability phalt = 0.1 and 8 walkers. For each graph we compute the relative Frobenius
norm error between the true (K) and approximated (K) b kernels: namely, ∥K − K∥ b F /∥K∥F .
We repeat 100 times to obtain the standard deviation of the mean error estimate. Fig. 7
shows the results; approximation quality degrades slightly as pedge grows, but remains very
good throughout (error < 0.00275).
Next, we investigate the scalability of the method by comparing the time for exact and
approximate evaluation of the diffusion kernel as the size of the graph grows. We fix pedge =
0.5 and vary the number of nodes N between 100 and 12800. For every graph, we measure
the wall-clock time for i) exact computation of K by computing the matrix exponential and
ii) approximate computation of K b using g-GRFs (8 walkers, phalt = 0.5, regulariser σ = 0.5).
Fig. 8 shows the results. Naturally, the exact method scales worse, becoming slower for
graphs bigger than a few thousand nodes, and by the largest graph g-GRFs are already
faster by a factor of 7.8. We also measure the relative Frobenius norm between the true and
approximated Gram matrices to check that the quality of estimation remains good. This is
shown by the green line, which indeed remains almost constant and takes very small values
for the error (≃ 0.005).

Figure 7: Approximation error vs. edge- Figure 8: Wall clock time for exact and ap-
generation probability for the diffusion ker- proximate kernel evaluation as the number
nel on a graph of size N = 100. The quality of nodes varies. g-GRFs scale better and are
remains good as sparsity changes, varying faster with a few thousand nodes. The ap-
within a narrow range. proximation error remains low as N grows.

Approximation error vs. edge probability Wall-clock time vs graph size 0.00600
0.00275 102 exact ev. time approximation error
approx. ev. time 0.00575
0.00270 101 0.00550
wall-clock time (s)
Frob. norm error

Frob. norm error

0.00525
0.00265 100
0.00500
0.00260 10 1 0.00475
0.00450
0.00255 10 2
0.00425
0.00250 0.00400
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 102 103 104
p no. nodes

A.7 Further experimental details

In this short section, we provide further experimental details and discussion to supplement
the main text.
1. Choice of phalt : The termination probability encodes a simple trade-off between
approximation quality and speed; if phalt is lower, walks tend to last longer and
sample more subwalks so give a better approximation of graph kernels. In prac-
tice, any reasonably small value works well, as also reported for the original GRFs
mechanism (Choromanski, 2023). In experiments we typically choose phalt ∼ 0.1.
2. Choice of σ and kernel convergence: P After Eq. 2 we noted that, when ap-
∞
proximating some fixed kernel Kα (W) = k=0 αk Wk , we assume that the sum
does not diverge. It would not be possible to constructP∞ a finite random feature
estimator if this were not the case. We require that k=0 αk ρ(W)k is finite, with
ρ(W) := maxλ∈Λ(W) (|λ|) the spectral radius of the weighted adjacency matrix
(Λ(W) is the set of eigenvalues of W). When considering e.g. the diffusion kernel
in the literature, this is ensured by multiplying W by some regulariser σ ∈ R+ , tak-

17
Published as a conference paper at ICLR 2024

ing e.g. W → σW whereupon λi → σλi ∀ i = 1, ..., N . This is the reason for extra
parameters {σ, α} in the kernel expressions in Table 1. As we noted in Sec. 3.5,
this is exactly equivalent to transforming the modulation function f (i) → f (i)σ i ∀ i.
Where space allows, we report the exact choice of σ with each experiment (though
empirically it does not modify p our conclusions). In Sec. 3.5 we take the weighted
:= N
adjacency matrix W [aij / di dj ]i,j=1 with aij = 1 if node i is connected to j and
0 otherwise, and di the degree of node i. We then regularise by taking W → σW
with σ = 0.025, which is small enough for convergence even with the largest graph
considered. We use m = 16 walkers and phalt = 0.5.

M348 Applied Statistical Modelling - Linear Models
No ratings yet
M348 Applied Statistical Modelling - Linear Models
504 pages
SADMJ12
No ratings yet
SADMJ12
19 pages
Nguyen 20 C
No ratings yet
Nguyen 20 C
11 pages
NeurIPS 2018 Kong Kernels For Ordered Neighborhood Graphs Paper
No ratings yet
NeurIPS 2018 Kong Kernels For Ordered Neighborhood Graphs Paper
10 pages
Shervashidze 11 A
No ratings yet
Shervashidze 11 A
23 pages
Theory of Graph Neural Networks: Representation and Learning
No ratings yet
Theory of Graph Neural Networks: Representation and Learning
23 pages
Graph Kernels: S.V. N. Vishwanathan
No ratings yet
Graph Kernels: S.V. N. Vishwanathan
42 pages
Wasserstein Weisfeiler-Lehman Graph Kernels
No ratings yet
Wasserstein Weisfeiler-Lehman Graph Kernels
19 pages
Liu Et Al. - 2021 - Random Features For Kernel Approximation A Survey On Algorithms, Theory, and Beyond
No ratings yet
Liu Et Al. - 2021 - Random Features For Kernel Approximation A Survey On Algorithms, Theory, and Beyond
35 pages
Graph Conv
No ratings yet
Graph Conv
16 pages
Pattern Vectors From Algebraic Graph Theory
No ratings yet
Pattern Vectors From Algebraic Graph Theory
14 pages
Module 4 - Analytical Questions
No ratings yet
Module 4 - Analytical Questions
10 pages
Lec35_36
No ratings yet
Lec35_36
39 pages
Machine Learning On Graphs: A Model and Comprehensive Taxonomy
No ratings yet
Machine Learning On Graphs: A Model and Comprehensive Taxonomy
49 pages
Graph Representation Learning
No ratings yet
Graph Representation Learning
141 pages
Estimating and Understanding Exponential Random Graph Models
No ratings yet
Estimating and Understanding Exponential Random Graph Models
36 pages
Learning Branching Heuristics From Graph Neural Networks
No ratings yet
Learning Branching Heuristics From Graph Neural Networks
19 pages
A Survey On Graph Kernels
No ratings yet
A Survey On Graph Kernels
42 pages
Reconstruction of Random Geometric Graphs: Breaking The Distortion Barrier
No ratings yet
Reconstruction of Random Geometric Graphs: Breaking The Distortion Barrier
36 pages
GNN Foundations Frontiers and Applications Chapter2
No ratings yet
GNN Foundations Frontiers and Applications Chapter2
10 pages
Robust Graph Dictionary Learning
No ratings yet
Robust Graph Dictionary Learning
21 pages
Oxford SC2 Transcribed Notes
No ratings yet
Oxford SC2 Transcribed Notes
42 pages
Graphnorm: A Principled Approach To Accelerating Graph Neural Network Training
No ratings yet
Graphnorm: A Principled Approach To Accelerating Graph Neural Network Training
25 pages
Structure Learning in Graphical Modeling
No ratings yet
Structure Learning in Graphical Modeling
28 pages
Symmetry-Aware Gflownets: Hohyun Kim Seunggeun Lee Min-Hwan Oh
No ratings yet
Symmetry-Aware Gflownets: Hohyun Kim Seunggeun Lee Min-Hwan Oh
29 pages
2007.08349v2 (1)
No ratings yet
2007.08349v2 (1)
15 pages
Robust Parameter Fitting To Realistic Network Models Via Iterative Stochastic Approximation
No ratings yet
Robust Parameter Fitting To Realistic Network Models Via Iterative Stochastic Approximation
13 pages
Ergodic Theory On Stationary Random Graphs: Itai Benjamini Nicolas Curien
No ratings yet
Ergodic Theory On Stationary Random Graphs: Itai Benjamini Nicolas Curien
20 pages
On The Equivalence Between Graph Isomorphism Testing and Function Approximation With Gnns
No ratings yet
On The Equivalence Between Graph Isomorphism Testing and Function Approximation With Gnns
22 pages
Lifelong Graph Learning: Preprint. Under Review
No ratings yet
Lifelong Graph Learning: Preprint. Under Review
12 pages
Semi-Relaxed Gromov Wasserstein Divergence With Applications On Graphs
No ratings yet
Semi-Relaxed Gromov Wasserstein Divergence With Applications On Graphs
28 pages
Research
No ratings yet
Research
12 pages
Anonymous Walk Embeddings: Polynomial
No ratings yet
Anonymous Walk Embeddings: Polynomial
10 pages
On Clustering Using Random Walks: Abstract. We Propose A Novel Approach To Clustering, Based On Deter
No ratings yet
On Clustering Using Random Walks: Abstract. We Propose A Novel Approach To Clustering, Based On Deter
24 pages
2020 - William L. Hamilton - Graph Representation Learning-Morgan & Claypool
No ratings yet
2020 - William L. Hamilton - Graph Representation Learning-Morgan & Claypool
161 pages
Graph Based Data Science
No ratings yet
Graph Based Data Science
37 pages
c5 Apelete Sossou Jad Zakharia
No ratings yet
c5 Apelete Sossou Jad Zakharia
6 pages
Notes On Deep Learning Theory
No ratings yet
Notes On Deep Learning Theory
68 pages
Peerj Cs 357
No ratings yet
Peerj Cs 357
62 pages
Ijcai Pricai 20
No ratings yet
Ijcai Pricai 20
7 pages
Adversarial Attack On Graph Structured Data
No ratings yet
Adversarial Attack On Graph Structured Data
10 pages
36 Neural Operator Graph Kernel N
No ratings yet
36 Neural Operator Graph Kernel N
21 pages
Subgraph2vec: A Random Walk-Based Algorithm For Embedding Knowledge Graphs
No ratings yet
Subgraph2vec: A Random Walk-Based Algorithm For Embedding Knowledge Graphs
6 pages
Topological Graph Neural Networks - Horn
No ratings yet
Topological Graph Neural Networks - Horn
27 pages
Community Detection With Graph Neural Networks
No ratings yet
Community Detection With Graph Neural Networks
16 pages
Spectral Coarse-Graining and Rescaling For Preserving Structural and Dynamical Properties in Graphs
No ratings yet
Spectral Coarse-Graining and Rescaling For Preserving Structural and Dynamical Properties in Graphs
7 pages
SCH Smo 03 C
No ratings yet
SCH Smo 03 C
24 pages
Node2vec: Scalable Feature Learning For Networks
No ratings yet
Node2vec: Scalable Feature Learning For Networks
10 pages
Thesis Proposal: Graph Structured Statistical Inference: James Sharpnack
No ratings yet
Thesis Proposal: Graph Structured Statistical Inference: James Sharpnack
20 pages
A Gentle Introduction To Graph Neural Networks
No ratings yet
A Gentle Introduction To Graph Neural Networks
9 pages
2508.02600v1
No ratings yet
2508.02600v1
34 pages
Short Tutorial On WL Test
No ratings yet
Short Tutorial On WL Test
5 pages
CS 224W Fall 2023 HW1
No ratings yet
CS 224W Fall 2023 HW1
11 pages
Interacting Stochastic Processes On Sparse Random Graphs
No ratings yet
Interacting Stochastic Processes On Sparse Random Graphs
26 pages
Improving Graph Neural Network Expressivity Via Subgraph Isomorphism Counting
No ratings yet
Improving Graph Neural Network Expressivity Via Subgraph Isomorphism Counting
12 pages
Graph CNN
No ratings yet
Graph CNN
13 pages
Representation Learning On Graphs: Methods and Applications
No ratings yet
Representation Learning On Graphs: Methods and Applications
23 pages
AItRBM Proof
No ratings yet
AItRBM Proof
23 pages
Node2vec: Scalable Feature Learning For Networks: Aditya Grover Et Al. Presented By: Saim Mehmood Ahmadreza Jeddi
No ratings yet
Node2vec: Scalable Feature Learning For Networks: Aditya Grover Et Al. Presented By: Saim Mehmood Ahmadreza Jeddi
30 pages
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
From Everand
Mesh Generation: Advances and Applications in Computer Vision Mesh Generation
Fouad Sabry
No ratings yet
ZBC 27008
No ratings yet
ZBC 27008
11 pages
Nihms 138054
No ratings yet
Nihms 138054
16 pages
Envhper00541 0175
No ratings yet
Envhper00541 0175
6 pages
Quantum Walks: A Comprehensive Review
No ratings yet
Quantum Walks: A Comprehensive Review
89 pages
Chain of Ideas: Revolutionizing Research Via Novel Idea Development With LLM Agents
No ratings yet
Chain of Ideas: Revolutionizing Research Via Novel Idea Development With LLM Agents
30 pages
Machine Learning in Artificial Intelligence: Towards A Common Understanding
No ratings yet
Machine Learning in Artificial Intelligence: Towards A Common Understanding
10 pages
IT and Coding study mat
No ratings yet
IT and Coding study mat
19 pages
Graph Cut Segment
No ratings yet
Graph Cut Segment
23 pages
3 Sls
No ratings yet
3 Sls
31 pages
Roots
No ratings yet
Roots
29 pages
MMEcon Handouts 18 Difference - Equation
No ratings yet
MMEcon Handouts 18 Difference - Equation
44 pages
Coeg 401 Past Questions
No ratings yet
Coeg 401 Past Questions
52 pages
Notes UNIT-3 and 5
No ratings yet
Notes UNIT-3 and 5
46 pages
HK222 - Statistics Quality Control-002
No ratings yet
HK222 - Statistics Quality Control-002
4 pages
AP Statistics - Chapter 10 Notes: Comparing Two Population Parameters 10.1: Comparing Two Proportions
No ratings yet
AP Statistics - Chapter 10 Notes: Comparing Two Population Parameters 10.1: Comparing Two Proportions
1 page
Matrix and Determinant (Revision Worksheet)
No ratings yet
Matrix and Determinant (Revision Worksheet)
36 pages
FDSA Unit-3 Notes
No ratings yet
FDSA Unit-3 Notes
23 pages
IntroduEconometrics - MBA 525 - FEB2024
No ratings yet
IntroduEconometrics - MBA 525 - FEB2024
266 pages
Deep Convolutional Neural Network (DCNN) For Skin Cancer Classification
No ratings yet
Deep Convolutional Neural Network (DCNN) For Skin Cancer Classification
4 pages
Kochen-Specker and Measurement
No ratings yet
Kochen-Specker and Measurement
8 pages
INT345 Aspx
No ratings yet
INT345 Aspx
2 pages
Breusch Godfrey Test of Autocorrelation
No ratings yet
Breusch Godfrey Test of Autocorrelation
7 pages
CV Lecture 06 FeatureDetection
No ratings yet
CV Lecture 06 FeatureDetection
75 pages
Dept. of Computer Science and Engineering: Networks Lab Manual
No ratings yet
Dept. of Computer Science and Engineering: Networks Lab Manual
38 pages
Research On Cloud Computing Load Forecasting Based On LSTM-ARIMA Combined Model
No ratings yet
Research On Cloud Computing Load Forecasting Based On LSTM-ARIMA Combined Model
5 pages
Aishwarya MiniProjectReport - SC
No ratings yet
Aishwarya MiniProjectReport - SC
6 pages
An Efficient Approach For Credit
No ratings yet
An Efficient Approach For Credit
17 pages
8th Lecture Delta Rule Learning s1 21 22
No ratings yet
8th Lecture Delta Rule Learning s1 21 22
48 pages
Product Helpfulness Detection With Novel Transformer Based BERT Embedding and Class Probability Features
No ratings yet
Product Helpfulness Detection With Novel Transformer Based BERT Embedding and Class Probability Features
13 pages
Calculation of Stiffness and Mass Orthogonal Vectors
No ratings yet
Calculation of Stiffness and Mass Orthogonal Vectors
20 pages
La Place
100% (1)
La Place
7 pages
Unit 3
No ratings yet
Unit 3
23 pages
Final Quiz 2 1
No ratings yet
Final Quiz 2 1
4 pages
Efficient Calculation of Clebsch-Gordan Coefficients
No ratings yet
Efficient Calculation of Clebsch-Gordan Coefficients
5 pages
OfficeStar Data (Segmentation)
No ratings yet
OfficeStar Data (Segmentation)
11 pages

General Graph Random Features

Uploaded by

General Graph Random Features

Uploaded by

Published as a conference paper at ICLR 2024

General Graph Random Features

We propose a novel random walk-based algorithm for unbiased estimation

1 Introduction and related work

enjoy widespread applications including in bioinformatics (Borgwardt et al., 2005), commu-

2 General graph random features

Moreover, f (i) can be efficiently computed with the iterative formula

2.1 Neural modulation functions, kernel learning and generalisation

Instead of using a fixed modulation function f to estimate a fixed kernel, it is possible to

Naturally, the bound on R(H)

3.1 Unbiased pointwise estimation of fixed kernels

Frob. norm error

Frob. norm error

Frob. norm error

Frob. norm error

Frob. norm error

Frob. norm error

3.2 Solving differential equations on graphs

An intriguing application of g-GRFs for fixed kernels is efficiently computing approximate

where x(t) ∈ RN is the state of the graph at time t, W ∈ RN ×N is a weighted adjacency

3.3 Efficient kernelised graph node clustering

As a further demonstration of the utility of Table 2: Errors in kernelised k-means

3.4 Learning f (N ) for better kernel approximation

Table 3: Kernel approximation error with Figure 4: Learned modulation function

3.5 Implicit kernel learning for node attribute prediction

Graph N Normalised ∆(pred error) c.f. learned

5 Ethics and reproducibility

6 Relative contributions and acknowledgements

Viacheslav Borovitskiy, Iskander Azangulov, Alexander Terenin, Peter Mostowsky, Marc

Colin Campbell. Kernel methods: a survey of current techniques. Neurocomputing, 48(1-4):

A.2 Proof of Theorem 2.1 (Unbiased approximation of Kα via convolutions)

A.3 Proof of Theorem 2.2 (Computing symmetric modulation functions)

A.4 Proof of Theorem 2.3 (Empirical Rademacher complexity bound)

Now note that

whereupon from Eq. 31 it follows that

A.5 Learning asymmetric f (N )

A.6 Further approximation error results for Erdős-Rényi graphs

Frob. norm error

A.7 Further experimental details

You might also like