Word2vec, Node2vec, Graph2vec, X2vec - Towards A Theory of Vector Embeddings of Structured Data
Word2vec, Node2vec, Graph2vec, X2vec - Towards A Theory of Vector Embeddings of Structured Data
1
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
Vector embeddings can bridge the gap between the “discrete” The key theoretical questions we will ask about vector embed-
world of relational data and the “differentiable” world of machine dings of objects in X are the following.
learning and for this reason have a great potential for database
research. Yet relatively little work has been done on embeddings of Expressivity: Which properties of objects X ∈ X are represented
relational data beyond the binary relations of knowledge graphs. by the embedding? What is the meaning of the induced
Throughout the paper, I will try to point out potential directions distance measure? Are there geometric properties of the
for database related research questions on vector embeddings. latent space that represent meaningful relations on X?
A vector embedding for a class X of objects is a mapping f from Complexity: What is the computational cost of computing the
X into some vector space, called the latent space, which we usually vector embedding? What are efficient embedding algorithms?
assume to be a real vector space Rd of finite dimension d. The How can we efficiently retrieve semantic information of the
idea is to define a vector embedding in such a way that geomet- embedded data, for example, answer queries?
ric relationships in the latent space reflect semantic relationships
between the objects in X. Most importantly, we want similar ob- A third question that relates to both expressivity and complexity is
jects in X to be mapped to vectors close to one another with re- what dimension to choose for the latent space. In general, we expect
spect to some standard metric on the latent space (say, Euclidean). a trade-off between (high) expressivity and (low) dimension, but it
For example, in an embedding of words of a natural language we may well be that there is an inherent dimension of the data set. It is
want words with similar meanings, like “shoe” and “boot”, to be an appealing idea (see, for example, [98]) to think of “natural” data
mapped to vectors that are close to each other. Sometimes, we want sets appearing in practice as lying on a low dimensional manifold
further-reaching correspondences between properties of and re- in high dimensional space. Then we can regard the dimension of
lations between objects in X and the geometry of their images in this manifold as the inherent dimension of the data set.
latent space. For example, in an embedding f of the entities of a Reasonably well-understood from a theoretical point of view are
knowledge base, among them Paris, France, Santiago, Chile, we node embeddings of graphs that aim to preserve distances between
may want t := f (Paris) − f (France) to be (approximately) equal nodes, that is, embeddings f : V (G) → Rd of the vertex set V (G) of
to f (Santiago) − f (Chile), so that the relation is-capital-of some graph G such that distG (x, y) ≈ ∥ f (x) − f (y)∥, where distG
corresponds to the translation by the vector t in latent space. is the shortest-path distance in G. There is a substantial theory of
A difficulty is that the semantic relationships and similarities such metric embeddings (see [64]). In many applications of node
between the objects in X can rarely be quantified precisely. They embeddings, metric embeddings are indeed what we need.
usually only have an intuitive meaning that, moreover, may be However, the metric is only one aspect of the information car-
application dependent. However, this is not necessarily a problem, ried by a graph or relational structure, and arguably not the most
because we can learn vector representations in such a way that they important one from a database perspective. Moreover, if we con-
yield good results when we use them to solve machine learning sider graph embeddings rather than node embeddings, there is no
tasks (so-called downstream tasks). This way, we never have to metric to start with. In this paper, we are concerned with structural
make the semantic relationships explicit. As a simple example, we vector embeddings of graphs, relational structures, and their nodes.
may use a nearest-neighbour based classification algorithm on the Two theoretical ideas that have been shown to help in understand-
vectors our embedding gives us; if it performs well then the distance ing and even designing vector embeddings of structures are the
between vectors must be relevant for this classification task. This Weisfeiler-Leman algorithm and various concepts in its context, and
way, we can even use vector embeddings, trained to perform well on homomorphism vectors, which can be seen as a general framework
certain machine learning tasks, to define semantically meaningful for defining “structural” (as opposed to “metric”) embeddings. We
distance measures on our original objects, that is, to define the will see that these theoretical concepts have a rich theory that con-
distance distf (X, Y ) between objects X, Y ∈ X to be ∥ f (X ) − f (Y )∥. nects them to the embedding techniques used in practice in various
We call distf the distance measure induced by the embedding f . ways.
In this paper, the objects X ∈ X we want to embed either are The rest of the paper is organised as follows. Section 2 is a very
graphs, possibly labelled or weighted, or more generally relational brief survey of some of the embedding techniques that can be found
structures, or they are nodes of a (presumably large) graph or more in the machine learning and knowledge representation literature.
generally elements or tuples appearing in a relational structure. In Section 3, we introduce the Weisfeiler-Leman algorithm. This
When we embed entire graphs or structures, we speak of graph algorithm, originally a graph isomorphism test, turns out to be an
embeddings or relational structure embeddings; when we embed only important link between the embedding techniques described in
nodes or elements we speak of node embeddings. These two types Section 2 and the theory of homomorphism vectors, which will be
of embeddings are related, but there are clear differences. Most discussed in detail in Section 4. Finally, Section 5 is devoted to a
importantly, in node embeddings there are explicit relations such discussion of similarity measures for graphs and structures.
as adjacency and derived relations such as distance between the
objects of X (the nodes of a graph), whereas in graph embeddings
all relations between objects are implicit or “semantic”, for example 2 EMBEDDING TECHNIQUES
“having the same number of vertices” or”having the same girth” In this section, we give a brief and selective overview of embedding
(see Figure 1). techniques. More thorough recent surveys are [50] (on node em-
beddings), [104] (on graph neural networks), [102] (on knowledge
graph embeddings), and [61] (on graph kernels).
2
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
3
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
value obtained from the neighbours and the current state of the Let me close this section by remarking that GNNs are used for
node as inputs and computes the new state of the node. In a simple all kinds of machine learning tasks on graphs and not only to
form, we may take the following functions: compute node embeddings. For example, a GNN based architecture
for graph classification would plug the output of the GNN layer(s)
(t +1) (t )
Õ
Aggregate : av ← Wagg · x w , (2.1) into a standard feedforward network (possibly consisting only of a
w ∈N (v) single softmax layer).
!!
(t )
(t +1) x
Update : xv ← σ Wup · (tv+1) , (2.2)
av 2.3 Knowledge Graph and Relational Structure
Embeddings
where Wagg ∈ Rc×d and Wup ∈ Rd×(c+d) are learned parameter Node embeddings of knowledge graphs have also been studied
matrices and σ is a nonlinear “activation” function, for example quite intensely in recent years, remarkably by a community that
the ReLU (rectified linear unit) function σ (x) := max{0, x } applied seems almost disjoint from that involved in the node embedding
pointwise to a vector. It is important to note that the parameter techniques described in Section 2.1. What makes knowledge graphs
matrices Wagg and Wup do not depend on the node v; they are somewhat special is that they come with labelled edges (or, equiv-
shared across all nodes of a graph. This parameter sharing allows alently, many different binary relations) as well as labelled nodes.
it to use the same GNN model for graphs of arbitrary sizes. It is not completely straightforward to adapt the methods of Sec-
Of course, we can also use more complicated aggregation and tion 2.1 to edge- and vertex-labelled graphs. Another important
update functions. We only want these functions to be differentiable difference is in the objective function: the methods of Section 2.1
to be able to use gradient descent optimisation methods in the train- mainly focus on the graph metric (even though approaches based
ing phase, and we want the aggregation function to be symmetric on random walks like node2vec are flexible and also incorporate
(t )
in its arguments x w for w ∈ N (v) to make sure that the GNN structural criteria). However, shortest-path distance is less relevant
computes a function that is isomorphism invariant. For example, in in knowledge graphs.
[100] we use a linear aggregation function and an update function Rather than focussing on distances, knowledge graph embed-
computed by an LSTM (long short-term memory, [51]), a specific dings focus on establishing a correspondence between the rela-
recurrent neural network component that allows it to “remember” tions of the knowledge graph and geometric relationships in the
(0) (1) (t ) latent space. A very influential algorithm, TransE [18] aims to
relevant information from the sequence x v , x v , . . . , x v .
associate a specific translation of the latent space with each rela-
The computation of such a GNN model starts from a initial
(0) tion. Recall the example of the introduction, where entities Paris,
configuration x v )v ∈V and proceeds through a fixed-number t of France, Santiago, Chile were supposed to be embedded in such
aggregation- and update-steps, resulting in a final configuration
(t )
a way that x Paris − x France ≈ x Santiago − x Chile , so that the
x v )v ∈V . Note that this configuration gives us a node embedding relation is-capital-of corresponds to the translation by t :=
(t )
v 7→ x v of the input graph. We can also stack several such GNN x Paris − x France .
layers, each with its own aggregation and activation function, on Another way of mapping relations to geometric relationships is
top of one another, using the final configuration of each (but the implemented in Rescal [83]. Here the idea is to associate a bilinear
last) layer as the initial configuration of the following layer and form β R with each relation R in such a way that for all entities v, w
the final configuration of the last layer as the node embedding. As it holds that β R (x v , x w ) ≈ 1 if (v, w) ∈ R and β R (x v , x w ) ≈ 0 if
initial states, we can take constant vectors like the all-ones vector (v, w) < R. We can represent such a bilinear form β R by a matrix
for each node, or we can assign a random initial state to each node. B R such that β R (x, y) = x ⊤ B R y. Then the objective is to minimise,
We can also use the initial state to represent the node labels if the simultaneously for all R, the term ∥X B R X ⊤ − AR ∥, where X is the
input graph is labelled. embedding matrix with rows x v and AR is the adjacency matrix
To train a GNN for computing a node-embedding, in principle we of the relation R. Note that this is a multi-relational version of the
can use any of the loss functions used by the embedding techniques matrix-factorisation approach described in Section 2.1, with the
described in Section 2.1. The reader may wonder what advantage additional twist that we also need to find the matrix B R for each
the complicated GNN architecture has over just optimising the relation R.
embedding matrix X (as the methods described in Section 2.1 do). Completing our remarks on knowledge graph embeddings, we
The main advantage is that the GNN method is inductive, whereas mention that it is fairly straightforward to generalise the GNN
the previously described methods are transductive. This means based node embeddings to (vertex- and edge-)labelled graphs and
that a GNN represents a function that we can apply to arbitrary hence to knowledge graphs [91].
graphs, not just to the graph it was originally trained on. So if the While there is a large body of work on embedding knowledge
graph changes over time and, for example, nodes are added, we graphs, that is, binary relational structures, not much is known
do not have to re-train the embedding, but just embed the new about embedding relations of higher arities. Of course one approach
nodes using the GNN model we already have, which is much more to embedding relational structures of higher arities is to transform
efficient. We can even apply the model to an entirely new graph and them into their binary incidence structures (see Section 4.2 for a
still hope it gives us a reasonable embedding. The most prominent definition) and then embed these using any of the methods available
example of an inductive node-embedding tool based on GNNs is for binary structures. Currently, I am not aware of any empirical
GraphSage [49]. studies on the practical viability of this approach. An alternative
4
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
approach [16, 17] is based on the idea of treating the rows of a table, techniques discussed before, they only play a minor role. For graph
that is, tuples in a relation, like sentences in natural language and embeddings, we are in the opposite situation: kernels are the domi-
then use word embeddings to embed the entities. nant technique. However, there are a few other approaches.
5
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
1-WL v1 0.3
w1
Input: Graph G w1 w2 w3 w4 w5 2
0.
0.3 2 1 0 0.7
7
Refinement Round: For all colours c in the current colouring
v2 v2 w3
and all nodes v, w of colour c, the nodes v and w get 1 0 1 1 1
different colours in the new colouring if there is some v3 w4
7
0.7 2 0 1 0.3
0.
colour d such that v and w have different numbers of 2
neighbours of colour d. v3 w5
0.3
The refinement is repeated until the colouring is stable, then the
stable colouring is returned. Figure 4: Stable colouring of a matrix and the corresponding
weighted bipartite graph computed by matrix WL
Algorithm 1: The 1-dimensional WL algorithm
where α(x, y) denotes the weight of the edge from x to y, and we set
α(x, y) = 0 if there is no edge from x to y. This idea also allows us
to define 1-WL on matrices: with a matrix A ∈ Rm×n we associate a
(a) initial graph (b) colouring after round 1
weighted bipartite graph with vertex set {v 1, . . . , vm , w 1, . . . , w n }
and edge weights α vi , w j ) := Ai j and α(vi , vi ′ ) = α(w j , w j ′ ) = 0
and run weighted 1-WL on this weighted graph with initial colour-
ing that distinguishes the vi (rows) from the w j (columns). An
example is shown in Figure 4. This matrix-version of WL was
applied in [44] to design a dimension reduction techniques that
(c) colouring after round 2 (d) stable colouring after round 3 speeds up the solving of linear programs with many symmetries
(or regularities).
Figure 3: A run of 1-WL
3.3 Higher-Dimensional WL
the two graphs. Unfortunately, 1-WL does not distinguish all non- For this paper, the 1-dimensional version of the Weisfeiler-Leman
isomorphic graphs. For example, it does not distinguish a cycle of algorithm is the most relevant, but let us briefly describe the higher
length 6 from the disjoint union of two triangles. But, remarkably, dimensional versions. In fact, it is the 2-dimensional version, also
1-WL does distinguish almost all graphs, in a precise probabilistic referred to as classical WL, that was introduced by Weisfeiler and
sense [8]. Leman [103] in 1968 and gave the algorithm its name. The k-
dimensional Weisfeiler-Leman algorithm (k-WL) is based on the
3.2 Variants of 1-WL same iterative-refinement idea as 1-WL. However, instead of ver-
tices, k-WL colours k-tuples of vertices of a graph. Initially, each
The version of 1-WL we have formulated is designed for undirected
k-tuple is “coloured” by the isomorphism type of the subgraph it
graphs. For directed graphs it is better to consider in-neighbours and
induces. Then in the refinement rounds, the colour information is
out-neighbours of nodes separately. 1-WL can easily be adapted to
propagated between “adjacent” tuples that only differ in one coor-
labelled graphs. If vertex labels are present, they can be incorporated
dinate (details can be found in [24]). If implemented using similar
in the initial colouring: two vertices get the same initial colour if
and only if they have the same label(s). We can incorporate edge ideas as for 1-WL, k-WL runs in time O(nk+1 log n) [53].
labels in the refinement rounds: two nodes v and w get different Higher-dimensional WL is much more powerful than 1-WL, but
colours in the new colouring if there is some colour d and some edge Cai, Fürer, and Immerman [24] proved that for every k there are
label λ such that v and w have a different number of λ-neighbours non-isomorphic graphs G k , Hk that are not distinguished by k-WL.
of colour d. These graphs, known as the CFI graphs, have size O(k) and are
However, if the edge labels are real numbers, which we interpret 3-regular.
as edge weights, or more generally elements of an arbitrary commu- DeepWL, a WL-Version of unlimited dimension that can distin-
tative monoid, then we can also use the following weighted version guish the CFI-graphs in polynomial time, was recently introduced
of 1-WL due to [44]. Instead of refining by the number of edges in [47].
into some colour, we refine by the sum of the edge weights into
that colour. Thus the refinement round of Algorithm 1 is modified 3.4 Logical and Algebraic Characterisations
as follows: for all colours c in the current colouring and all nodes The beauty of the WL algorithm lies in the fact that its expressive-
v, w of colour c, v and w get different colours in the new colouring ness has several natural and completely unrelated characterisations.
6
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
Of these, we will see two in this section. Later, we will see two more C0
characterisations in terms of GNNs and homomorphism numbers.
The logic C is the extension of first-order logic by counting C1
quantifiers of the form ∃ ≥p x (“there exists at least p elements x”).
Every C-formula is equivalent to a formula of plain first-order logic.
However, here we are interested in fragments of C obtained by
restricting the number of variables of formulas, and the translation C2
from C to first-order logic may increase the number of variables. G
For every k ≥ 1, by Ck we denote the fragment of C consisting
of all formulas with at most k (free or bound) variables. The finite Figure 5: Viewing colours of WL as trees
variable logics Ck play an important role in finite model theory
(see, for example, [40]). Cai, Fürer, and Immerman [24] have related
these fragments to the WL algorithm. Note that to decide whether two graphs G, H with adjacency
matrices G, H are fractionally isomorphic, we can minimise the
Theorem 3.1 ([24]). Two graphs are Ck +1 -equivalent, that is, they convex function ∥AX − X B∥F , where X ranges over the convex set
satisfy the same sentences of the logic Ck +1 , if and only if k-WL does of doubly stochastic matrices. To minimise this function, we can
not distinguish the graphs. use standard gradient descent techniques for convex minimisation.
It was shown in [57] that, surprisingly, the refinement rounds of 1-
Let us now turn to an algebraic characterisation of WL. Our
WL closely correspond to the iterations of the Frank-Wolfe convex
starting point is the observation that two graphs G, H with vertex
sets V ,W and adjacency matrices A ∈ RV ×V , B ∈ RW ×W are
minimisation algorithm.
isomorphic if and only there is a permutation matrix X ∈ RV ×W
3.5 Weisfeiler-Leman Graph Kernels
such that X ⊤AX = B. Recall that a permutation matrix is a {0, 1}-
matrix that has exactly one 1-entry in each row and in each column. The WL algorithm collects local structure information and prop-
Since permutation matrices are orthogonal (i.e., they satisfy X ⊤ = agates it along the edges of a graph. We can define very effective
X −1 ), we can rewrite this as AX = X B, which has the advantage of graph kernels based on this local information. For every i ≥ 0,
being linear. This corresponds to the following linear equations in let Ci be the set of colours that 1-WL assigns to the vertices of a
the variables Xvw , for v ∈ V and w ∈ W : graph in the i-th round. Figure 5 illustrates that we can identify the
Õ Õ colours in Ci with rooted trees of height i. For every graph G and
Avv ′ Xv ′w = Xvw ′ Bw ′w for all v ∈ V , w ∈ W . (3.2) every colour c ∈ Ci , by wl(c, G) we denote the number of vertices
v ′ ∈V w ′ ∈W that receive colour c in the ith round of 1-WL.
We can add equations expressing that the row and column sums of Example 3.3. For the graph G shown in Figure 5 we have
the matrix X are 1, which implies that X is a permutation matrix if
the Xvw are nonnegative integers. wl , G = 2, wl , G = 0.
Õ Õ
Xvw ′ = Xv ′w = 1 for all v ∈ V , w ∈ W . (3.3) For every t ∈ N, the t-round WL-kernel is the mapping K WL
(t )
w ′ ∈W v ′ ∈V defined by
Obviously, equations (3.2) and (3.3) have a nonnegative integer t Õ
(t )
Õ
solution if and only if the graphs G and H are isomorphic. This does K WL (G, H ) := wl(c, G) · wl(c, H )
not help much from an algorithmic point of view, because it is NP- i=0 c ∈C i
hard to decide if a system of linear equations and inequalities has an for all graphs G, H . It is easy to see that this mapping is symmetric
integer solution. But what about nonnegative rational solutions? We and positive-semidefinite and thus indeed a kernel mapping; the
know that we can compute them in polynomial time. A nonnegative corresponding vector embedding maps each graph G to the vector
rational solution to (3.2) and (3.3), which can also be seen as a t
doubly stochastic matrix satisfying AX = X B, is called a fractional
Ø
wl(c, G) c ∈ Ci .
isomorphism between G and H . If such a fractional isomorphism i=0
exists, we say that G and H are fractionally isomorphic. Tinhofer [99] Note that formally, we are mapping G to an infinite dimensional
proved the following theorem. vector space, because all the sets Ci for i ≥ 1 are infinite. However,
Theorem 3.2 ([99]). Graphs G and H are fractionally isomorphic for a graph G of order n the vector as at most kn + 1 nonzero entries.
if and only if 1-WL does not distinguish G and H . We can also define a version K WL of the WL-kernel that does not
depend on a fixed-number of rounds by letting
A corresponding theorem also holds for the weighted and the Õ 1 Õ
K WL (G, H ) := wl(c, G) · wl(c, H ).
matrix version of 1-WL [44]. Moreover, Atserias and Maneva [5]
i ≥0
2i c ∈C
proved a generalisation that relates k-WL to the level-k Sherali- i
Adams relaxation of the system of equations and thus yields an The WL-kernel was introduced by Shervashidze et al. [94] under the
algebraic characterisation of k-WL indistinguishability (also see name Weisfeiler-Leman subtree kernel. They also introduce variants
[45, 71] and [6, 13, 39, 84] for related algebraic aspects of WL). such as a Weisfeiler-Leman shortest path kernel. A great advantage
7
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
the WL (subtree) kernel has over most of the graph kernels dis- directed graphs they have to preserve the edge direction. Of course
cussed in Section 2.4 is its efficiency, while performing at least as we can generalise homomorphisms to arbitrary relational struc-
good as other kernels on downstream tasks. Shervashidze et al. [94] tures, and we remind the reader of the close connection between
report that in practice, t = 5 is a good number of rounds for the homomorphisms and conjunctive queries. We denote the number
t-round WL-kernel. of homomorphisms from F to G by hom(F, G).
There are also graph kernels based on higher dimensional WL
Example 4.1. For the graph G shown in Figure 5 we have
algorithm [76].
hom , G = 18, hom , G = 114.
3.6 Weisfeiler-Leman and GNNs
(t )
Recall that a GNN computes a sequence (x v )v ∈V , for t ≥ 0, of To calculate these numbers, we observe that for the star Sk (tree of
vector embeddings of a graph G = (V , E). In the most general form, height 1 with k leaves) we have hom(Sk , G) = v ∈V (G) degG (v)k .
Í
it is recursively defined by
For every class F of graphs, the homomorphism counts hom(F, G)
(t +1)
(t ) give a graph embedding HomF defined by
xv = f UP x v , f AGG x w w ∈ N (v) ,
HomF (G) := hom(F, G) F ∈ F
where the aggregation function f AGG is symmetric in its arguments.
It has been observed in several places [49, 78, 106] that this is very for all graphs G. If F is infinite, the latent space RF of the embedding
similar to the update process of 1-WL. Indeed, it is easy to see that HomF is an infinite dimensional vector space. By suitably scaling
(0)
if the initial embedding x v is constant then for any two vertices the infinite series involved, we can define an inner product on a
(t ) (t ) subspace HF of RF that includes the range of HomF . This also
v, w, if 1-WL assigns the same colour to v and w then x v = x w .
gives us a graph kernel. One way of making this precise is as
This implies that two graphs that cannot be distinguished by 1-WL
follows. For every k, we let Fk be the set of all F ∈ F of order
will give the same result for any GNN applied to them; that is,
|F | := |V (F )| = k. Then we let
GNNs are at most as expressive as 1-WL. It is shown in [78] that a
∞
converse of this holds as well, even if the aggregation and update Õ 1 Õ 1
K F (G, H ) := hom(F, G) · hom(F, H ). (4.1)
functions of the GNN are of a very simple form (like (2.1) and (2.2)
k=1
|Fk |
F ∈F
kk
k
in Section 2.2). Based on the connection between WL and logic, a
more refined analysis of the expressiveness of GNNs was carried There are various other ways of doing this, for example, rather than
out in [10]. However, the limitations of the expressiveness only hold looking at the sum over all F ∈ Fk we may look at the maximum.
(0)
if the initial embedding x v is constant (or at least constant on all In practice, one will simply cut off of the infinite series and only
1-WL colour classes). We can increase the expressiveness of GNNs consider a finite subset of F. A problem with using homomorphism
(0) vectors as graph embeddings is that the homomorphism numbers
by assigning random initial vectors x v to the vertices. The price
quickly get tremendously large. In practice, we take logarithms of
we pay for this increased expressiveness is that the output of a run
theses numbers, possibly scaled by the size of the graphs from F. So,
of the GNN model is no longer isomorphism invariant. However,
a practically reasonable graph embedding based on homomorphism
the whole randomised process is still isomorphism invariant. More
(0) vectors would take a finite class F of graphs and map each G to the
formally, the random variable that associates an output x v )v ∈V vector
with each graph G is isomorphism invariant. 1
log hom(F, G) F ∈ F .
A fully invariant way to increase the expressiveness of GNNs |F |
is to build “higher-dimensional” GNNs, inspired by the higher- Initial experiments show that this graph embedding performs very
dimensional WL algorithm. Instead of nodes of the graphs, they well on downstream classification tasks even if we take F to be
operate on constant sized tuples or sets of vertices. A flexible archi- a small class (of size 20) of graphs consisting of binary trees and
tecture for such higher-dimensional GNNs is proposed in [78]. cycles. This is a good indication that homomorphism vectors extract
relevant features from a graph. Note that the size of the class F is
4 COUNTING HOMOMORPHISMS the dimension of the feature space.
Most of the graph kernels and also some of the node embedding Apart from these practical considerations, homomorphisms vec-
techniques are based on counting occurrences of substructures tors have a beautiful theory that links them to various natural
like walks, cycles, or trees. There are different ways of embedding notions of similarity between structures, including indistinguisha-
substructures into a graph. For example, walks and paths are the bility by the Weisfeiler-Leman algorithm.
same structures, but we allow repeated vertices in a walk. Formally,
“walks” are homomorphic images of path graphs, whereas “paths” 4.1 Homomorphism Indistinguishability
are embedded path graphs. It turns that homomorphisms and ho- Two graphs G and H are homomorphism-indistinguishable over a
momorphic images give us a very robust and flexible “basis” for class F of a graphs if HomF (G) = HomF (H ). Lovász proved that
counting all kinds of substructures [30]. homomorphism indistinguishability over the class G of all graphs
A homomorphism from a graph F to a graph G is a mapping corresponds to isomorphism.
h from the nodes of F to the nodes of G such that for all edges
uu ′ of F the image h(u)h(u ′ ) is an edge of G. On labelled graphs, Theorem 4.2 ([65]). For all graphs G and H ,
homomorphisms have to preserve vertex and edge labels, and on HomG (G) = HomG (H ) ⇐⇒ G and H are isomorphic.
8
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
9
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
While the proof of Theorem 4.4 relies on techniques similar to 4.2 Beyond Undirected Graphs
the proof of Theorem 4.2, the proof of Theorem 4.3 is based on So far, we have only considered homomorphism indistinguishability
spectral techniques similar to the proof of Theorem 4.3. on undirected graphs. A few results are known for directed graphs.
In particular, Theorem 4.2 directly extends to directed graphs. Ac-
Example 4.7. For the co-spectral graphs G, H shown in Figure 6
tually, we have the following stronger result, also due to Lovász
we have
[66] (also see [14]).
hom , G = 20, hom , H = 16. Theorem 4.11 ([66]). For all directed graphs G and H ,
HomDA (G) = HomDA (G) ⇐⇒ G and H are isomorphic.
Thus HomP (G) , HomP (H ).
Here DA denote the class of all directed acyclic graphs.
Example 4.8. Figure 7 shows graphs G, H with It is straightforward to extend Theorem 4.2, Theorem 4.4 (for the
HomP (G) = HomP (H ). natural generalisation of the WL algorithms to relational structures),
and Theorem 4.10 to arbitrary relational structures. This is very
Obviously, 1-WL distinguishes the two graphs. Thus HomT (G) , useful for binary relational structures such as knowledge graphs.
HomT (H ). It can also be checked that the graphs are not co-spectral. But for relations of higher arity one may consider another version
Hence HomC (G) , HomC (H ) based on the incidence graph of a structure.
Let σ = {R 1, . . . , Rm } be a relational vocabulary, where Ri is
Combined with Theorem 3.1, Theorem 4.4 implies the follow-
a ki -ary relation symbol. Let k be the maximum of the ki . We
ing correspondence between homomorphism counts of graphs of
let σI := {E 1, . . . , Ek , P1, . . . , Pm }, where the Ei are binary and
bounded tree width and the finite variable fragments of the count-
the P j are unary relation symbols. With every σ -structure A =
ing logic C introduced in Section 3.4.
(V (A), R 1 (A), . . . , Rm (A)) we associates a σI -structure AI called the
Corollary 4.9. For all graphs G and H and all k ≥ 1, incidence structure of A as follows:
• the universe of AI is
HomTk (G) = HomTk (H ) ⇐⇒ G and H are Ck +1 -equivalent.
Øm
V (AI ) := V (A) ∪ (Ri , v 1, . . . , vki (v 1, . . . , vki ) ∈ Ri (A) ;
This is interesting, because it shows that the homomorphism
vector HomTk (G) gives us all the information necessary to answer i=1
queries expressed in the logic Ck +1 . Unfortunately, the result does • for 1 ≤ j ≤ k, the relation E j (AI ) consists of all pairs
not tell us how to answer Ck +1 -queries algorithmically if we have v j , (Ri , v 1, . . . , vki )
access to the vector HomTk (G). To make the question precise, sup- for (Ri , v 1, . . . , vki ) ∈ V (AI ) with ki ≥ j;
pose we have oracle access that allows us to obtain, for every graph • for 1 ≤ i ≤ m, the relation Pi consist of all (Ri , v 1, . . . , vki ) ∈
F ∈ Tk , the entry hom(F, G) of the homomorphism vector. Is it V (AI ).
possible to answer a Ck+1 -query in polynomial time (either with
With this encoding of general structures as binary incidence struc-
respect to data complexity or combined complexity)?
tures we obtain the following corollary.
Arguably, from a logical perspective it is even more natural to
restrict the quantifier rank (maximum number of nested quantifiers) Corollary 4.12. For all σ -structures A and B, the following are
in a formula rather than the number of variables. Let Ck be the equivalent.
fragment of C consisting of all formulas of quantifier rank at most (1) HomT(σI ) (AI ) = HomT(σI ) (B I ), where T(σI ) denotes the
k. We obtain the following characterisation of Ck -equivalence in class of all σI -structures whose underlying (Gaifman) graph is
terms of homomorphism vectors over the class of graphs of tree a tree;
depth at most k. Tree depth, introduced by Nešetřil and Ossona (2) AI and B I are not distinguished by 1-WL;
de Mendez [81], is another structural graph parameter that has (3) AI and B I are C2 -equivalent.
received a lot of attention in recent years (e.g. [9, 23, 28, 34, 35]).
Böker [14] gave a generalisation of Theorem 4.4 to hypergraphs
Theorem 4.10 ([42]). For all graphs G and H and all k ≥ 1, that is also based on incidence graphs.
In a different direction, we can generalise the results to weighted
HomTDk (G) = HomTDk (H ) ⇐⇒ G and H are Ck -.equivalent.
graphs. Let us consider undirected graphs with real-valued edge
Here TDk denotes the class of all graphs of tree depth at most k. weights. We can also view them as symmetric matrices over the
10
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
reals. Recall that we denote the weight of an edge uv by α(u, v) and vectors to define node embeddings. A rooted graph is a pair (G, v)
that weighted 1-WL refines by sums of edge weights (instead of where G is a graph and v ∈ V (G). For two rooted graphs (F, u) and
numbers of edges). Let F be an unweighted graph and G a weighted (G, v), by hom(F, G; u 7→ v) we denote the number of homomor-
graph. For every mapping h : V (F ) → V (G) we let phism h from F to G with h(u) = v. For a class F∗ of rooted graphs
Ö and a rooted graph (G, v), we let
wt(h) := α(h(u), h(u ′ )).
HomF∗ (G, v) := hom(F, G; u 7→ v) (F, u) ∈ F∗ .
uu ′ ∈E(F )
As α(v, v ′ ) = 0 if andonly if vv ′
< E(G), we have wt(h) , 0 if If we keep the graph G fixed, this gives us an embedding of the
and only if h is a homomorphism from F to G; the weight of this nodes of G into an infinite dimensional vector space. Note that in
homomorphism is the product of the weights of the edges in its the terminology of Section 2.1, this embedding is “inductive” and
image. We let not “transductive”, because it is not tied to a fixed graph. (Never-
Õ theless, the term “inductive” is not fitting very well here, because
hom(F, G) := wt(h).
the embedding is not learned.) In the same way we defined graph
h:V (F )→V (G)
kernels based on homomorphism vectors of graphs, we can now
In statistical physics, such sum-product functions are known as define node kernels.
partition functions. For a class F of graphs, we let It is straightforward to generalise Theorem 4.2 to rooted graphs,
HomF (G) := hom(F, G) F ∈ F . showing that for all rooted graphs (G, v) and (H, w) it holds that
Theorem 4.13 ([22]). For all weighted graphs G and H , the follow- HomG∗ (G, v) = HomG∗ (H, w) ⇐⇒ there is an isopmorphism f
ing are equivalent. from G to H with f (v) = w.
(1) HomT (G) = HomT (H ) (recall that T denotes the class of all Here G∗ denote the class of all rooted graphs. Maybe the easiest
trees); way to prove this is by a reduction to node-labelled graphs.
(2) G and H are not distinguished by weighted 1-WL; Another key result of Section 4.1 that can be adapted to the node
(3) equations (3.2) and (3.3) have a nonnegative rational solution. setting is Theorem 4.4. We only state the version for trees.
Theorem 4.14. For all graphs G, H and all v ∈ V (G), w ∈ V (H ),
4.3 Complexity the following are equivalent.
In general, counting the number of homomorphisms from a graph (1) HomT ∗ (G, v) = HomT ∗ (H, w) for the class T ∗ of all rooted
F to a graph G is a #P-hard problem. Dalmau and Jonsson [31] trees;
proved that, under the reasonable complexity theoretic assumption (2) 1-WL assigns the same colour to v and w.
#W[1] , FPT from parameterised complexity theory, for all classes
F of graphs, computing hom(F, G) given a graph F ∈ F and an This result is implicit in the proof of Theorem 4.4 (see [32, 33]).
arbitrary graph G, is in polynomial time if and only if F has bounded In fact, it can be viewed as the graph theoretic core of the proof.
tree width. This makes Theorem 4.4 even more interesting, because We sketch the proof here and also show how to derive Theorem 4.4
the entries of a homomorphism vector HomF (G) are computable (for trees) from it.
in polynomial time precisely for bounded tree width classes.
Proof sketch. Recall from Section 3.5 that we can view the
However, the computational problem we are facing is not to com-
colours assigned by 1-WL as rooted trees (see Figure 5). For the kth
pute individual entries of a homomorphism vectors, but to decide if
round of WL, this is a tree of height k, and for the stable colouring
two graphs have the same homomorphism vector, that is, if they are
we can view it as an infinite tree. Suppose now that the colour of a
homomorphism indistinguishable. The characterisation theorems
vertex v of G is T . The crucial observation is that for every rooted
of Section 4.1 imply that homomorphism indistinguishability is
tree (S, r ) the number hom(S, G; r 7→ v) is precisely the number of
polynomial-time decidable over the classes P of paths, C of cycles,
mappings h : V (S) → V (T ) that map the root r of S to the root of T
T of trees, Tk of graphs of tree width at most k, TDk of tree depth
and, for each node s ∈ V (S), map the children of s to the children
at most k. Moreover, homomorphism indistinguishability over the
of h(s) in T . Let us call such mappings rooted tree homomorphisms.
class of G of all graphs is decidable in quasi-polynomial time by
Implication (2) =⇒ (1) follows directly from this observation.
Babai’s [7] celebrated result that graph isomorphism is decidable in
Implication (1) =⇒ (2) follows as well, because by an argument
quasi-polynomial time. It was proved in [15] that homomorphism
similar to that used in the proof of Theorem 4.2 it can be shown
indistinguishability over the class of complete graphs is complete
that for distinct rooted trees T ,T ′ there is a rooted tree that has
for the complexity class C= P, which implies that it is co-NP hard,
distinct numbers of rooted tree homomorphisms to T ,T ′ . □
and that there is a polynomial time decidable class F of graphs of
bounded tree width such that homomorphism indistinguishability Corollary 4.15. For all graphs G, H and all v ∈ V (G), w ∈ V (H ),
over F is undecidable. Quite surprisingly, the fact that quantum the following are equivalent.
isomorphism is undecidable [4] implies that homomorphism distin- (1) HomT ∗ (G, v) = HomT ∗ (H, w);
guishability over the class of planar graphs is undecidable. (2) for all formulas φ(x) of the logic C2 ,
11
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
(d ) Îm di
Section 2.1. They are solely based on structural properties and ig- Thus, letting a j := i=1 ai j , we have
nore the distance information. Results like Corollary 4.15 show that
the structural information captured by the homomorphism-based n n
(d ) (d)
Õ Õ
embeddings in principle enables us to answer queries directly on the a j p j = hom(T (d), G) = hom(T (d), H ) = aj qj .
embedding, which may be more useful than distance information j=1 j=1
in database applications. Using these additional equations, it can be shown that p j = q j for
We close this section by sketching how Theorem 4.4 (for trees) all j (see [43, Lemma 4.2]). Thus WL does not distinguish G and
follows from Theorem 4.14. H. □
Proof sketch of Theorem 4.4 (for trees). Let G, H be graphs.
We need to prove 4.5 Homomorphisms and GNNs
We have a correspondence between homomorphism vectors and
HomT (G) = HomT (H ) ⇐⇒ 1-WL does not distinguish G and H . and the Weisfeiler-Leman algorithm (Theorems 4.4 and 4.14) and
(4.4) between the the WL algorithm and GNNs (see Section 3.6). This
Without loss of generality we assume that V (G) ∩ V (H ) = ∅. For also establishes a correspondence between homomorphism vectors
x, y ∈ V (G) ∪V (H ), we write x ∼ y if 1-WL assigns the same colour and GNNs. More directly, the correspondence between GNNs and
to x and y. By Theorem 4.14, for X, Y ∈ {G, H } and x ∈ V (X ), homomorphism counts is also studied in [69].
y ∈ V (Y ) we have x ∼ y if and only if HomT ∗ (X, x) = HomT ∗ (X, y).
Let R 1, . . . , Rn be the ∼-equivalence classes, and for every j, let
5 SIMILARITY
P j := R j ∩ V (G) and Q j := R j ∩ V (H ). Furthermore, let p j := |P j |
and q j := |Q j |. The results described in the previous section can be interpreted
We first prove the backward direction of (4.4). Assume that 1-WL as results on the expressiveness of homomorphism-based embed-
does not distinguish G and H . Then p j = q j for all j ∈ [n]. Let T be dings of structures and their nodes. However, all these results only
a tree, and let t ∈ V (T ). Let h j := hom(T , X ; t 7→ x) for x ∈ R j and show what it means that two objects are mapped to the same ho-
X ∈ {G, H } with x ∈ V (X ). Then momorphism vector. More interesting is the similarity measure the
Õ vector embeddings induce via some an inner product/norm on the
hom(T , G) = hom(T , G; t 7→ v) latent space (see (4.1)). We can speculate that, given the nice re-
v ∈V (G) sults regarding equality of vectors, the similarity measure will have
Õn n
Õ similarly nice properties. Let me propose the following, admittedly
= pj hj = qj hj vague, hypothesis.
j=1 j=1
Õ For suitable classes F, the homomorphism embedding
= hom(T , H ; t 7→ w) = hom(T , H ). HomF combined with a suitable inner product on the
w ∈V (H ) latent space induces a natural similarity measure on
Since T was arbitrary, this proves HomT (G) = HomT (H ). graphs or relational structures.
The proof of the forward direction of (4.4) is more complicated. From a practical perspective, we could support this hypothesis by
Assume HomT (G) = HomT (H ). There is a finite collection of m ≤ showing that the vector embeddings give good results when com-
n
2 rooted trees (T1, r 1 ), . . . , (Tm , rm ) such that for all X, Y ∈ {G, H } bined with similarity based downstream tasks. As mentioned earlier,
and x ∈ V (X ), y ∈ V (Y ) we have x ∼ y if and only if for all i ∈ [m], initial experiments show that homomorphism vectors in combina-
tion with support vector machines perform well on standard graph
hom(Ti , X ; r i 7→ x) = hom(Ti , Y ; r i 7→ y).
classification benchmarks. But a more thorough experimental study
Let ai j := hom(Ti , X ; r i 7→ x) for x ∈ R j and X ∈ {G, H } with will be required to have conclusive results.
x ∈ V (X ). Then for all i we have From a theoretical perspective, we can compare the homomor-
n n
phism-based similarity measures with other similarity measures for
Õ Õ
ai j p j = hom(Ti , G) = hom(Ti , H ) = ai j q j graphs and discrete structures. If we can prove that they coincide
j=1 j=1
or are close to each other, then this would support our hypothesis.
Unfortunately, the matrix A = (ai j )i ∈[m],j ∈[n] is not necessarily 5.1 Similarity from Matrix Norms
invertible, so we cannot directly conclude that p j = q j for all j.
A standard way of defining similarity measures on graphs is based
All we know is that for any two columns of the matrix there is a
on comparing their adjacency matrices. Let us briefly review a
row such that the two columns have distinct values in that row. It
few matrix norms. Recall the standard ℓp -vector norm ∥x ∥p :=
turns out that this is sufficient. For every vector d = (d 1, . . . , dm ) Í p 1/p . 1 The two best-known matrix norms are the Frobe-
of nonnegative integers, let (T (d ), r (d ) ) be the rooted tree obtained i |x i | qÍ
by taking the disjoint union of di copies of Ti for all i and then nius norm ∥M ∥F := i,j Mi j and the spectral norm ∥M ∥ ⟨2⟩ :=
2
identifying the roots of all these trees. It is easy to see that
m
Ö
hom(T (d ), X ; r (d ) 7→ x) = hom(Ti , X ; r i 7→ x). 1 Note that ∥x ∥2 is just the Euclidean norm, which we denoted by ∥x ∥ earlier in this
i=1 paper.
12
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
supx ∈Rn , ∥x ∥2 =1 ∥Mx ∥2 . More generally, for every p > 0 we define single vertex we need to flip to turn G into a graph isomorphic to
H . Formally,
1/p
|Mi j |p ®
©Õ
∥M ∥p :=
ª
dist1 (G, H ) = 2 min f (v)f (v ′ ) vv ′ ∈ E(G) △E(H ) . (5.3)
« i,j ¬ f :V →W
bijection
(so ∥M ∥F = ∥M ∥2 ) and dist ⟨1⟩ (G, H ) =
where S,T range over all subsets of the index set of the matrix. Despite these intuitive interpretations, it is debatable how much
Observe that for M ∈ Rn×n we have “semantic relevance” these distance measures have. How similar are
two graphs that can be transformed into each other by flipping, say,
∥M ∥□ ≤ ∥M ∥1 ≤ n∥M ∥F , 5% of the edges? Again, the answer to this question may depend
on the application context.
where the second inequality follows from the Cauchy-Schwarz
A big disadvantage the graph distance measures based on matrix
inequality. If we compare matrices of different size, it can be rea-
norms have is that computationally they are highly intractable (see,
sonable to scale the norms by a factor depending on n.
for example, [3] and the references therein). It is even NP-hard to
For technical reasons, we only consider matrix norms ∥ · ∥ that
compute the distance between two trees (see [46] for Frobenius
are invariant under permutations of the rows and columns, that is,
distance and [38] for the distances based on operator norms), and
∥M ∥ = ∥MP ∥ = ∥QM ∥ for all permutation matrices P, Q. (5.1) distances are hard to approximate. The problem of computing these
distances is related to the maximisation version of the quadratic
It is easy to see that the norms discussed above have this property.
assignment problem (see [70, 79]), a notoriously hard combinatorial
Now let G, H be graphs with vertex sets V ,W and adjacency
optimisation problem. Better behaved is the cut-distance dist□ ; at
matrices A ∈ RV ×V , B ∈ RW ×W . For convenience, let us assume least it can be approximated within a factor of 2 [2].
that |G | = |H | =: n. Then both A, B are n × n-matrices, and we can The main source of hardness is the minimisation over the un-
compare them using a matrix norm. However, it does not make wieldy set of all permutations (or permutation matrices). To alleviate
much sense to just consider ∥A − B∥, because graphs do not have this hardness, we can relax the integrality constraints and minimise
a unique adjacency matrix, and even if G and H are isomorphic, over the convex set of all doubly stochastic matrices instead. That
∥A − B∥ may be large. Therefore, we align the two matrices in an is, we define a relaxed distance measure
optimal way by permuting the rows and columns of A. For a matrix
norm ∥ · ∥, we define a graph distance measure dist ∥ · ∥ g ∥ · ∥ (G, H ) =
dist min ∥AX − X B∥. (5.5)
⊤ X ∈ {0,1}V ×W
dist ∥ · ∥ (G, H ) := min ∥P AP − B∥. doubly stochastic
P ∈ {0,1}V ×W
permutation matrix
Note that dist
g ∥ · ∥ is only a pseudo-metric: the distance between non-
It follows from (5.1) that dist ∥ · ∥ is well-defined, that is, does not isomorphic graphs may be 0. Indeed, it follows from Theorem 3.2
depend on the choice of the particular adjacency matrices A, B. It g ∥ · ∥ (G, H ) = 0 if and only if G and H are fractionally isomor-
that dist
also follows from (5.1) and that fact that P −1 = P ⊤ for permutation phic. The advantage of these “relaxed” distances is that for many
matrices that
norms ∥ · ∥, computing dist g ∥ · ∥ is a convex minimisation problem
dist ∥ · ∥ (G, H ) = min ∥AP − PB∥, (5.2) that can be solved efficiently.
P ∈ {0,1}V ×W So far, we have only discussed distance measures for graphs
permutation matrix
of the same order. To extend these distance measures to arbitrary
which is often easier to work with because the expression AP − graphs, we can replace vertices by sets of identical vertices in both
PB is linear in the “variables” Pi j . To simplify the notation, we graphs to obtain two graphs whose order is the least common
let distp := dist ∥ · ∥p and dist ⟨p ⟩ := dist ∥ · ∥⟨p⟩ for all p, and we let multiple of the orders of the two initial graphs (see [67, Section 8.1]
dist□ := dist ∥ · ∥□ . for details).
The distances defined from the ℓ1 -norm have natural interpre- Note that these matrix based similarity measures are only defined
tations as edit distances. dist1 (G, H ) is twice the number of edges for (possibly weighted) graphs. In particular for the operator norms,
that need to be flipped to turn G into a graph isomorphic to H , it is not clear how to generalise them to relational structures, and
and dist ⟨1⟩ (G, H ) is the maximum number of edges incident with a if such a generalisation would even be meaningful.
13
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
5.2 Comparing Homomorphism Distances and [5] A. Atserias and E. Maneva. 2013. Sherali–Adams Relaxations and Indistinguisha-
bility in Counting Logics. SIAM J. Comput. 42, 1 (2013), 112–137.
Matrix Distances [6] A. Atserias and J. Ochremiak. 2018. Definable Ellipsoid Method, Sums-of-
It would be very nice if we could establish a connection between Squares Proofs, and the Isomorphism Problem. In Proceedings of the 33rd Annual
ACM/IEEE Symposium on Logic in Computer Science. 66–75.
graph distance measures based on homomorphism vectors and [7] L. Babai. 2016. Graph Isomorphism in Quasipolynomial Time. In Proceedings of
those based on matrix norms. At least one important result in this the 48th Annual ACM Symposium on Theory of Computing (STOC ’16). 684–697.
[8] L. Babai, P. Erdös, and S. Selkow. 1980. Random graph isomorphism. SIAM J.
direction exists: Lovász [67] proves an equivalence between the cut- Comput. 9 (1980), 628–635.
distance of graphs and a distance measure derived from a suitably [9] M. Bannach and T. Tantau. 2016. Parallel Multivariate Meta-Theorems. In
scaled homomorphism vector HomG . Proceedings of the 11th International Symposium on Parameterized and Exact
Computation (LIPIcs), J. Guo and D. Hermelin (Eds.), Vol. 63. Schloss Dagstuhl -
It is tempting to ask if a similar correspondence can be estab- Leibniz-Zentrum für Informatik, 4:1–4:17.
g ∥ · ∥ and HomF . There are many related question
lished between dist [10] P. Barceló, E.V. Kostylev, M. Monet, J. Pérez, J. Reutter, and J.P. Silva. 2020. The
Logical Expressiveness of Graph Neural Networks. In Proceedings of the 8th
that deserve further attention. International Conference on Learning Representations. https://openreview.net/
forum?id=r1lZ7AEKvB
[11] M. Belkin and P. Niyogi. 2003. Laplacian Eigenmaps for Dimensionality Reduc-
6 CONCLUDING REMARKS tion and Data Representation. Neural Computation 15, 6 (2003), 1373–1396.
In this paper, we gave an overview of embeddings techniques for [12] C. Berkholz, P. Bonsma, and M. Grohe. 2017. Tight Lower and Upper Bounds for
the Complexity of Canonical Colour Refinement. Theory of Computing Systems
graphs and relational structures. Then we discussed two related 60, 4 (2017), 581–614.
theoretical approaches, the Weisfeiler-Leman algorithm with its [13] C. Berkholz and M. Grohe. 2015. Limitations of Algebraic Approaches to Graph
Isomorphism Testing. In Proceedings of the 42nd International Colloquium on Au-
various ramifications and homomorphism vectors. We saw that they tomata, Languages and Programming, Part I (Lecture Notes in Computer Science),
have a rich and beautiful theory that leads to new, generic families M.M. Halldórsson, K. Iwama, N. Kobayashi, and B. Speckmann (Eds.), Vol. 9134.
of vector embeddings and helps us to get a better understanding of Springer Verlag, 155–166.
[14] J. Böker. 2019. Color Refinement, Homomorphisms, and Hypergraphs. In Pro-
some of the techniques used in practice, for example graph neural ceedings of the 45th International Workshop on Graph-Theoretic Concepts in
networks. Computer Science (Lecture Notes in Computer Science), I. Sau and D.M. Thilikos
Yet we have also seen that we are only at the beginning and many (Eds.), Vol. 11789. Springer, 338–350.
[15] J. Böker, Y. Chen, M. Grohe, and G. Rattan. 2019. The Complexity of Homomor-
questions remain open, in particular when it comes to similarity phism Indistinguishability. In Proceedings of the 44th International Symposium on
measures defined on graphs and relational structures. Mathematical Foundations of Computer Science (Leibniz International Proceedings
in Informatics (LIPIcs)), P. Rossmanith, P. Heggernes, and J.-P. Katoen (Eds.),
From a database perspective, it will be important to generalise Vol. 138. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 54:1–54:13.
the embedding techniques to relations of higher arities, which is [16] R. Bordawekar and O. Shmueli. 2017. Using word embedding to enable semantic
not as trivial as it may seem (and where surprisingly little has been queries in relational databases. In Proceedings of the Data Management for End
to End Learning Work Workshop, SIGMOD’17.
done so far). A central question is then how to query the embedded [17] R. Bordawekar and O. Shmueli. 2019. Exploiting Latent Information in Relational
data. Which queries can we answer at all when we only see the Databases via Word Embedding and Application to Degrees of Disclosure. In
vectors in latent space? How do imprecisions and variations due to Proceedings of the 9th Biennial Conference on Innovative Data Systems Research.
[18] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. 2013.
randomness affect the outcome of such query answers? Probably, Translating embeddings for modeling multi-relational data. In Advances in
we can only answer queries approximately, but what exactly is the neural information processing systems. 2787–2795.
[19] C. Borgs, J. Chayes, L. Lovász, V. Sós, B. Szegedy, and K. Vesztergombi. 2006.
semantics of such approximations? These are just a few questions Graph limits and parameter testing. In Proceedings of the 38th Annual ACM
that need to be answered, and I believe they offer very exciting Symposium on Theory of Computing. 261–270.
research opportunities for both theoreticians and practitioners. [20] K.M. Borgwardt and H.-P. Kriegel. 2005. Shortest-path kernels on graphs. In
Proceedings of the 5th IEEE International Conference on Data Mining. 74–81.
[21] J. Bourgain. 1985. On Lipschitz embeddings of finite metric spaces in Hilbert
spaces. Israel Journal of Mathematics 52, 1-2 (1985), 46–52.
Acknowledgements [22] A. Bulatov, M. Grohe, and G. Rattan. [n.d.]. In preparation.
This paper was written in strange times during COVID-19 lock- [23] J. Bulian and A. Dawar. 2014. Graph isomorphism parameterized by elimination
distance to bounded degree. In Proceedings of the 9th International Symposium
down. I appreciate it that some of my colleagues nevertheless took on Parameterized and Exact Computation (Lecture Notes in Computer Science),
the time to answer various questions I had on the topics covered M. Cygan and P. Heggernes (Eds.), Vol. 8894. Springer Verlag, 135–146.
here and to give valuable feedback on an earlier version of this [24] J. Cai, M. Fürer, and N. Immerman. 1992. An optimal lower bound on the number
of variables for graph identification. Combinatorica 12 (1992), 389–410.
paper. In particular, I would like to thank Pablo Barceló, Neta Fried- [25] S. Cao, W. Lu, and Q. Xu. 2015. GraRep: Learning graph representations with
man, Benny Kimelfeld, Christopher Morris, Petra Mutzel, Martin global structural information. In Proceedings of the 24th ACM International on
Ritzert, and Yufei Tao. Conference on Information and Knowledge Management. 891–900.
[26] S. Cao, W. Lu, and Q. Xu. 2016. Deep neural networks for learning graph repre-
sentations. In Proceedings of the 30th AAAI Conference on Artificial Intelligence.
1145–1152.
REFERENCES [27] A. Cardon and M. Crochemore. 1982. Partitioning a graph in O ( |A | log2 |V |).
[1] A. Ahmed, N. Shervashidze, S. Narayanamurthy, V. Josifovski, and A.J. Smola. Theoretical Computer Science 19, 1 (1982), 85 – 98.
2013. Distributed large-scale natural graph factorization. In Proceedings of the [28] Y. Chen and J. Flum. 2018. Tree-depth, Quantifier Elimination, and Quantifier
22nd International World Wide Web Conference. 37–48. Rank. In Proceedings of the 33rd Annual ACM/IEEE Symposium on Logic in
[2] N. Alon and A. Naor. 2006. Approximating the Cut-Norm via Grothendieck’s Computer Science. 225–234.
Inequality. SIAM J. Comput. 35 (2006), 787–803. [29] C. Cortes and V. Vapnik. 1995. Support-vector networks. Machine Learning 20,
[3] V. Arvind, J. Köbler, S. Kuhnert, and Y. Vasudev. 2012. Approximate Graph Iso- 3 (1995), 273–297.
morphism.. In Proceedings of the 37th International Symposium on Mathematical [30] R. Curticapean, H. Dell, and D. Marx. 2017. Homomorphisms are a good basis
Foundations of Computer Science (Lecture Notes in Computer Science), B. Rovan, for counting small subgraphs. In Proceedings of the 49th Annual ACM SIGACT
V. Sassone, and P. Widmayer (Eds.), Vol. 7464. Springer Verlag, 100–111. Symposium on Theory of Computing (STOC ’17). 210–223.
[4] A. Atserias, Laura Mančinska, D.E. Roberson, R. Šámal, S. Severini, and A. [31] V. Dalmau and P. Jonsson. 2004. The complexity of counting homomorphisms
Varvitsiotis. 2019. Quantum and non-signalling graph isomorphisms. Journal seen from the other side. Theoretical Computer Science 329, 1-3 (2004), 315–323.
of Combinatorial Theory, Series B 136 (2019), 289–328.
14
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
[32] H. Dell, M. Grohe, and G. Rattan. 2018. Lovász Meets Weisfeiler and Leman. In [60] R. Kondor and J.D. Lafferty. 2002. Diffusion Kernels on Graphs and Other
Proceedings of the 45th International Colloquium on Automata, Languages and Discrete Input Spaces. In Proceedings of the 19th International Conference on
Programming (Track A) (LIPIcs), I. Chatzigiannakis, C. Kaklamanis, D. Marx, and Machine Learning. 315–322.
D. Sannella (Eds.), Vol. 107. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, [61] N.M. Kriege, F.D Johansson, and C. Morris. 2019. A survey on graph kernels.
40:1–40:14. ArXiv arXiv:1903.11835 [cs.LG] (2019).
[33] Z. Dvorák. 2010. On recognizing graphs by numbers of homomorphisms. Journal [62] N. Kriege, M. Neumann, C. Morris, K. Kersting, and P. Mutzel. 2019. A unifying
of Graph Theory 64, 4 (2010), 330–342. view of explicit and implicit feature maps of graph kernels. Data Mining and
[34] M. Elberfeld, M. Grohe, and T. Tantau. 2016. Where First-Order and Monadic Knowledge Discovery 33, 6 (2019), 1505–1547. https://doi.org/10.1007/s10618-
Second-Order Logic Coincide. ACM Transaction on Computational Logic 17, 4 019-00652-0
(2016). Article No. 25. [63] J.B. Kruskal. 1964. Multidimensional scaling by optimizing goodness of fit to a
[35] M. Elberfeld, A. Jakoby, and T. Tantau. 2012. Algorithmic Meta Theorems for nonmetric hypothesis. Psychometrika 29, 1 (1964), 1–27.
Circuit Classes of Constant and Logarithmic Depth. In Proceedings of the 29th [64] N. Linial, E. London, and Y. Rabinovich. 1995. The geometry of graphs and some
International Symposium on Theoretical Aspects of Computer Science (LIPIcs), of its algorithmic applications. Combinatorica 15 (1995), 212–245.
C. Dürr and T. Wilke (Eds.), Vol. 14. Schloss Dagstuhl - Leibniz-Zentrum fuer [65] L. Lovász. 1967. Operations with Structures. Acta Mathematica Hungarica 18
Informatik, 66–77. (1967), 321–328.
[36] C. Gallicchio and A. Micheli. 2010. Graph echo state networks. In Proceedings [66] L. Lovász. 1971. On the cancellation law among finite relational structures.
of the IEEE International Joint Conference on Neural Networks. Periodica Mathematica Hungarica 1, 2 (1971), 145–156.
[37] T. Gärtner, P. Flach, and S. Wrobel. 2003. On graph kernels: Hardness results and [67] L. Lovász. 2012. Large Networks and Graph Limits. American Mathematical
efficient alternatives. In Learning theory and kernel machines. Springer Verlag, Society.
129–143. [68] L. Lovász and B. Szegedy. 2006. Limits of dense graph sequences. Journal of
[38] T. Gervens. 2018. Spectral Graph Similarity. Master Thesis at RWTH Aachen. Combinatorial Theory, Series B 96, 6 (2006), 933–957.
[39] E. Grädel, M. Grohe, B. Pago, and W. Pakusa. 2019. A Finite-Model-Theoretic [69] T. Maehara and H. NT. 2019. A Simple Proof of the Universality of Invari-
View on Propositional Proof Complexity. Logical Methods in Computer Science ant/Equivariant Graph Neural Networks. ArXiv arXiv:1910.03802 [cs.LG] (2019).
15, 1 (2019), 4:1–4:53. [70] K. Makarychev, R. Manokaran, and M. Sviridenko. 2014. Maximum quadratic
[40] E. Grädel, P.G. Kolaitis, L. Libkin, M. Marx, J. Spencer, M.Y. Vardi, Y. Venema, and assignment problem: Reduction from maximum label cover and lp-based ap-
S. Weinstein. 2007. Finite Model Theory and Its Applications. Springer Verlag. proximation algorithm. ACM Transactions on Algorithms 10, 4 (2014), 18.
[41] M. Grohe. 2017. Descriptive Complexity, Canonisation, and Definable Graph [71] P. Malkin. 2014. Sherali–Adams relaxations of graph isomorphism polytopes.
Structure Theory. Lecture Notes in Logic, Vol. 47. Cambridge University Press. Discrete Optimization 12 (2014), 73–97.
[42] M. Grohe. 2020. Counting Bounded Tree Depth Homomorphisms. ArXiv [72] L. Mančinska and D.E. Roberson. 2019. Quantum isomorphism is equiv-
arXiv:2003.08164 [cs.LO] (2020). alent to equality of homomorphism counts from planar graphs. ArXiv
[43] M. Grohe. 2020. Counting Bounded Tree Depth Homomorphisms. In Submitted. arXiv:1910.06958v2 [quant-ph] (2019).
[44] M. Grohe, K. Kersting, M. Mladenov, and E. Selman. 2014. Dimension Reduction [73] B.D. McKay and A. Piperno. 2014. Practical graph isomorphism, II. Journal of
via Colour Refinement. In Proceedings of the 22nd Annual European Symposium Symbolic Compututation 60 (2014), 94–112.
on Algorithms (Lecture Notes in Computer Science), A. Schulz and D. Wagner [74] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, and J. Dean. 2013. Distributed
(Eds.), Vol. 8737. Springer-Verlag, 505–516. representations of words and phrases and their compositionality. In Proceedings
[45] M. Grohe and M. Otto. 2015. Pebble Games and Linear Equations. Journal of of the 27th Annual Conference on Neural Information Processing Systems. 3111–
Symbolic Logic 80, 3 (2015), 797–844. 3119.
[46] M. Grohe, G. Rattan, and G. Woeginger. 2018. Graph Similarity and Approximate [75] H.L. Morgan. 1965. The generation of a unique machine description for chemical
Isomorphism. In Proceedings of the 43rd International Symposium on Mathemat- structures—a technique developed at chemical abstracts service. Journal of
ical Foundations of Computer Science (LIPIcs), I. Potapov, P.G. Spirakis, and Chemical Documentation 5, 2 (1965), 107–113.
J. Worrell (Eds.), Vol. 117. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, [76] C. Morris, K. Kersting, and P. Mutzel. 2017. Globalized Weisfeiler-Lehman
20:1–20:16. Graph Kernels: Global-Local Feature Maps of Graphs. In Proceedings of the 2017
[47] M. Grohe, P. Schweitzer, and D. Wiebking. 2020. Deep Weisfeiler Leman. ArXiv IEEE International Conference on Data Mining. 327–336.
arXiv:2003.10935 [cs.LO] (2020). [77] C. Morris, N.M. Kriege, K. Kersting, and P. Mutzel. 2016. Faster kernel for
[48] A. Grover and J. Leskovec. 2016. node2vec: Scalable feature learning for net- graphs with continuous attributes via hashing. In Proceedings of the 16th IEEE
works. In Proceedings of the 22nd ACM SIGKDD International Conference on International Conference on Data Mining. 1095–1100.
Knowledge Discovery and Data Mining, B. Krishnapuram, M. Shah, A.J. Smola, [78] C. Morris, M. Ritzert, M. Fey, W. Hamilton, J.E. Lenssen, G. Rattan, and M. Grohe.
C.C. Aggarwal, D. Shen, and R. Rastogi: (Eds.). 855–864. 2019. Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks. In
[49] W. Hamilton, R. Ying, and J. Leskovec. 2017. Inductive Representation Learn- Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Vol. 4602-4609.
ing on Large Graphs. In Proceedings of the 30th Annual Conference on Neural AAAI Press.
Information Processing Systems. 1024–1034. [79] V. Nagarajan and M. Sviridenko. 2009. On the maximum quadratic assignment
[50] W.L. Hamilton, R. Ying, and J. Leskovec. 2017. Representation learning on problem. In Proceedings of the twentieth Annual ACM-SIAM Symposium on
graphs: methods and applications. ArXiv arXiv:1709.05584 [cs.SI] (2017). Discrete Algorithms. 516–524.
[51] S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural [80] A. Narayanan, M. Chandramohan, R. Venkatesan, L. Chen, Y. Liu, and S. Jaiswal.
Computation 9, 8 (1997), 1735–1780. 2017. graph2vec: Learning Distributed Representations of Graphs. ArXiv (CoRR)
[52] T. Horváth, T. Gärtner, and S. Wrobel. 2004. Cyclic pattern kernels for predictive arXiv:1707.05005 [cs.AI] (2017).
graph mining. In Proceedings of the 10th ACM SIGKDD International Conference [81] J. Nešetřil and P. Ossona de Mendez. 2006. Linear time low tree-width partitions
on Knowledge Discovery and Data Mining. 158–167. and algorithmic consequences. In Proceedings of the 38th ACM Symposium on
[53] N. Immerman and E. Lander. 1990. Describing graphs: A first-order approach Theory of Computing. 391–400.
to graph canonization. In Complexity theory retrospective, A. Selman (Ed.). [82] M. Neumann, R. Garnett, and K. Kersting. 2013. Coinciding walk kernels: Parallel
Springer-Verlag, 59–81. absorbing random walks for learning with graphs and few labels. In Proceedings
[54] P. Indyk. 2001. Algorithmic applications of low-distortion geometric embeddings. of the 5th Asian Conference on Machine Learning. 357–372.
In Proceedings of the 42nd IEEE Symposium on Foundations of Computer Science. [83] M. Nickel, V. Tresp, and H.-P. Kriegel. 2011. A three-way model for collec-
10–33. tive learning on multi-relational data. In Proceedings of the 28th International
[55] W. Johnson and J. Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Conference on Machine Learning. 809–816.
Hilbert space. Contemp. Math. 26 (1984), 189–206. [84] R. O’Donnell, J. Wright, C. Wu, and Y. Zhou. 2014. Hardness of Robust Graph
[56] H. Kashima, K. Tsuda, and A. Inokuchi. 2003. Marginalized kernels between Isomorphism, Lasserre Gaps, and Asymmetry of Random Graphs. In Proceedings
labeled graphs. In Proceedings of the 20th International Conference on Machine of the 25th Annual ACM-SIAM Symposium on Discrete Algorithms. 1659–1677.
Learning. 321–328. [85] M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu. 2016. Asymmetric transitivity pre-
[57] K. Kersting, M. Mladenov, R. Garnet, and M. Grohe. 2014. Power Iterated Color serving graph embedding. In Proceedings of the 22nd ACM SIGKDD International
Refinement. In Proceedings of the 28th AAAI Conference on Artificial Intelligence, Conference on Knowledge Discovery and Data Mining. 1105–1114.
P. Stone C.E. Brodley (Ed.). 1904–1910. [86] S. Pan, R. Hu, G. Long, J. Jiang, L. Yao, and C. Zhang. 2018. Adversarially Regular-
[58] T.N. Kipf and M. Welling. 2016. Variational Graph Auto-Encoders. ArXiv ized Graph Autoencoder for Graph Embedding. ArXiv (CoRR) arXiv:1802.04407
arXiv:1611.07308 [stat.ML] (2016). [cs.LG] (2018). http://arxiv.org/abs/1802.04407
[59] T. N. Kipf and M. Welling. 2017. Semi-supervised classification with graph [87] B. Perozzi, R. Al-Rfou, and S. Skiena. 2014. Deepwalk: Online learning of social
convolutional networks. In Proceedings of the 5th International Conference on representations. In Proceedings of the 20th ACM SIGKDD International Conference
Learning Representations. on Knowledge Discovery and Data Mining. 710–710.
15
Keynote PODS ’20, June 14–19, 2020, Portland, OR, USA
[88] T. Pham, T. Tran, D. Phung, and S. Venkatesh. 2017. Column networks for Vol. 2777. Springer Verlag, 144–158.
collective classification. In Proceedings of the 31st AAAI Conference on Artificial [97] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. 2015. LINE: Large-scale
Intelligence. 2485–2491. information network embedding. In Proceedings of the 24th International World
[89] J. Ramon and T. Gärtner. 2003. Expressivity versus efficiency of graph kernels. Wide Web Conference. 1067–1077.
In Proceedings of the 1st International Workshop on Mining Graphs, Trees and [98] J. Tenenbaum, V. De Silva, and J. Langford. 2000. A global geometric framework
Sequences. 65–74. for nonlinear dimensionality reduction. Science 290 (2000), 2319–2323.
[90] F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, and G. Monfardini. 2009. The [99] G. Tinhofer. 1991. A note on compact graphs. Discrete Applied Mathematics 30
graph neural network model. IEEE Transactions on Neural Networks 20, 1 (2009), (1991), 253–264.
61–80. [100] J. Tönshoff, M. Ritzert, H. Wolf, and M. Grohe. 2019. Graph Neural Networks
[91] M . Schlichtkrull, T.N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling. for Maximum Constraint Satisfaction. ArXiv (CoRR) arXiv:1909.08387 [cs.AI]
2018. Modeling relational data with graph convolutional networks. In Pro- (2019).
ceedings of the European Semantic Web Conference (Lecture Notes in Computer [101] D. Wang, P. Cui, and W. Zhu. 2016. Structural deep network embedding. In
Science), A. Gangemi, R. Navigli, M.-E. Vidal, P. Hitzler, R. Troncy, L. Hollink, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
A. Tordai, and M. Alam (Eds.), Vol. 10843. Springer Verlag, 593–607. Discovery and Data Mining. 1225–1234.
[92] B. Schölkopf, A. Smola, and K.-R. Müller. 1997. Kernel principal component [102] Q. Wang, Z. Mao, B. Wang, and L. Guo. 2017. Knowledge Graph Embedding: A
analysis. In Proceedings of the International Conference on Artificial Neural Net- Survey of Approaches and Applications. IEEE TRANSACTIONS ON KNOWL-
works (Lecture Notes in Computer Science), W. Gerstner, A. Germond, M. Hasler, EDGE AND DATA ENGINEERING 29, 12 (2017), 2724–2743.
and J.D. Nicoud (Eds.), Vol. 1327. Springer Verlag, 583–588. [103] B. Weisfeiler and A. Leman. 1968. The reduction of a graph to canonical form and
[93] S. Shalev-Shwartz and S. Ben-David. 2014. Understanding Machine Learning: the algebgra which appears therein. NTI, Series 2 (1968). English transalation by
From Theory to Algorithms. Cambridge University Press. G. Ryabov avalable at https://www.iti.zcu.cz/wl2018/pdf/wl_paper_translation.
[94] N. Shervashidze, P. Schweitzer, E.J. van Leeuwen, K. Mehlhorn, and K.M. Borg- pdf.
wardt. 2011. Weisfeiler-Lehman Graph Kernels. Journal of Machine Learning [104] Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P.S. Yu. 2019. A comprehensive
Research 12 (2011), 2539–2561. survey on Graph Neural Networks. ArXiv arXiv:1901.00596 [cs.Lg] (2019).
[95] N. Shervashidze, S. Vishwanathan, T. Petri, K. Mehlhorn, and K. Borgwardt. 2009. [105] Z. Xinyi and L. Chen. 2019. Capsule Graph Neural Network. In Proceedings of
Efficient graphlet kernels for large graph comparison. In Artificial Intelligence the 7th International Conference on Learning Representations. OpenReviw.net.
and Statistics. 488–495. https://openreview.net/forum?id=Byl8BnRcYm
[96] A.J. Smola and R. Kondor. 2003. Kernels and Regularization on Graphs. In [106] K. Xu, W. Hu, J. Leskovec, and S. Jegelka. 2019. How powerful are graph
Proceedings of the 16th Annual Conference on Computational Learning Theory neural networks?. In Proceedings of the 7th International Conference on Learning
(Lecture Notes in Computer Science), B. Schölkopf and M.K. Warmuth (Eds.), Representations.
16