Algorithms 17 00116
Algorithms 17 00116
Article
Progressive Multiple Alignment of Graphs
Marcos E. González Laffitte 1,2, * and Peter F. Stadler 1,2,3,4,5,6,7, *
Abstract: The comparison of multiple (labeled) graphs with unrelated vertex sets is an important
task in diverse areas of applications. Conceptually, it is often closely related to multiple sequence
alignments since one aims to determine a correspondence, or more precisely, a multipartite matching
between the vertex sets. There, the goal is to match vertices that are similar in terms of labels and
local neighborhoods. Alignments of sequences and ordered forests, however, have a second aspect
that does not seem to be considered for graph comparison, namely the idea that an alignment is
a superobject from which the constituent input objects can be recovered faithfully as well-defined
projections. Progressive alignment algorithms are based on the idea of computing multiple align-
ments as a pairwise alignment of the alignments of two disjoint subsets of the input objects. Our
formal framework guarantees that alignments have compositional properties that make alignments
of alignments well-defined. The various similarity-based graph matching constructions do not share
this property and solve substantially different optimization problems. We demonstrate that optimal
multiple graph alignments can be approximated well by means of progressive alignment schemes.
The solution of the pairwise alignment problem is reduced formally to computing maximal common
induced subgraphs. Similar to the ambiguities arising from consecutive indels, pairwise alignments
of graph alignments require the consideration of ambiguous edges that may appear between align-
Citation: González Laffitte, M.E.;
ment columns with complementary gap patterns. We report a simple reference implementation in
Stadler, P.F. Progressive Multiple
Alignment of Graphs. Algorithms 2024,
Python/NetworkX intended to serve as starting point for further developments. The computational
17, 116. https://doi.org/10.3390/ feasibility of our approach is demonstrated on test sets of small graphs that mimimc in particular
a17030116 applications to molecular graphs.
as a means of comparing RNA secondary structures [4]. More recently, a formal framework
was developed to define alignments in a much more general setting [5], which we will
review in the Theory section below (Section 2) in some detail. In this general setting, align-
ments are considered as super-objects containing the contributing input objects (sequences,
trees, graphs, etc.) such that these are recovered by means of well-defined projections.
Specializing this framework to graphs yields a notion of multiple graph alignment (MGA)
that pertains to both directed and undirected graphs and also accommodates labeled
graphs. The key property of an MGA is that the constituent input graphs appear as induced
subgraphs of the alignment graph.
The term “graph alignment” is frequently used in the literature to refer to match-
ings between the vertex sets of graphs that maximize some similarity measure [6–8]. The
construction considered in [9] uses “alignment” columns allowing “dummy nodes” cor-
responding to gap symbols, but does not endow the “alignment” with a graph structure.
Similarly, IsoRank in its 1-1 mapping mode [10] computes a maximum weighted multipar-
tite matching between the vertices of input networks Gi constrained in such a way that
the connected components of its transitive closure (i.e., the alignment columns) contain
at most one vertex from each of the Gi . The weights are application-specific similarity
scores computed for all pairs of vertices from different Gi and typically combine similarities
of vertex attributes with neighborhood similarities, as in FINAL [11] or HashAlign [12].
Different algorithmic approaches have been employed to solve this optimization problem.
While [10] uses a greedy heuristic, integer quadratic programming is proposed in [13].
Moreover, a wide variety of learning approaches have been used in recent years for this
type of graph comparison, see, e.g., [14] and the references therein.
In summary, all these graph matching methodologies differ in two important aspects
from the concept of alignments used here: (1) they do not require the strict preservation
of the local structure inherent in alignments and (2) they do not consider the alignment
again as a graph. Furthermore, in [15], a “graph alignment” was defined by means of
injective embeddings of the input graphs Gi into an “entity graph” H such that vertices
adjacent in Gi are mapped to adjacent vertices in H. In contrast to our framework, however,
non-adjacent vertices in Gi may be mapped to adjacent vertices in H. The constituent
graph Gi thus appears as subgraphs of H but not induced subgraphs of H and thus is not
recoverable from H as a projection.
The compositional properties of alignments [5] require both the projective property and
the fact that the alignment is again a graph. As a consequence “alignments of graph
alignments” are well defined, preserve the projections of the constituent input graphs, and
thus are again graph alignments of the same constituent graphs. This is a prerequisite
for introducing the notion of a progressive graph alignment guided by a similarity tree
(Section 3.1), in analogy to the progressive methodologies employed for multiple sequence
alignment [16]; for an overview of our methodology, see Figure 1. Graph matchings, in
contrast, have much weaker compositional properties. As noted in [9], pairwise graph
matchings ( Gi , G0 ) with a common “reference graph” G0 can be combined to a matching of
multiple graphs. On the other hand, in our setting, alignments of graph alignments can
be combined arbitrarily. The compositional properties of the construction in [15], which
lacks the projective property, have not been studied to our knowledge. Graph matching
approaches, including [15], thus address combinatorial optimization problems that are
clearly distinct from the formal graph alignments studied here.
The multiple alignment problem is NP-complete already for sequences [17–19]; hence,
in practice, one has to resort to heuristics. A simple but efficient approach are progressive
alignments [16], which reduce the problem to computing pairwise alignments of alignments
of subsets of input objects. Progressive multiple alignments of rooted ordered forests have
been considered in [4] as means of comparing multiple RNA structures. Star alignments,
comprising the pairwise alignments of all input objects to common reference objects, are re-
lated to progressive alignments. This strategy has been explored in [9] for graph matchings.
Algorithms 2024, 17, 116 3 of 23
For large networks, an evolutionary algorithm [20] and an ant-colony algorithm [21] have
been explored.
Here, we describe a progressive alignment procedure that is based on exact pairwise
graph alignments. As we shall see, these can be computed from maximum common
induced subgraphs (MCISs). To this end, we first introduce the formal theory of multiple
alignments of graphs that properly generalizes sequence alignments. We then demonstrate
that MGAs can be computed with a progressive framework in practice and show that this
approach yields accurate results. Since we consider here an optimization problem that is
significantly different from graph matching approaches considered in the literature, we
refrain from comparing MGAs with graph matching methods.
Figure 1. Overview of the ProGrAlign to compute a multiple graph alignment of a collection of input
graphs (a). A fast heuristic, here using a graph kernel, is used to determined pairwise similarities
between the input graphs. The similarity matrix (b) is used to construct a guide tree (c) using a
clustering method such as WPGMA [22]. This guide tree dictates the order in which one is to compute
pairwise alignments. Alignments of graphs, e.g., of G1 and G3, are again graphs. Thus, alignments of
alignments with other input graphs, here G2, or graph alignments, are well defined and reduced to
computing pairwise graph alignments. The final result is obtained as the alignment corresponding to
the root of the guide tree. (d) Every input graph is contained in the multiple alignment graph as an
induced subgraph, emphasized here with darker vertices and edges.
2. Theory
2.1. Abstract Graph Alignments
Following [5], we consider a class of finite hereditary set systems ( X, S ), where X is
X
a finite set and S is some subset of 2X ∪ 22 ∪ . . . such that for every subset Y ⊆ X the
restriction SY of S to Y is well defined and satisfies the following consistency property
SY = (SZ )Y for all Y ⊆ Z ⊆ X. In the following, we will refer to ( X, S ) as an object.
For Y ⊆ X, (Y, SY ) =: ( X, S )[Y ] will be called the subobject of X induced by Y. Two set
systems ( X, S ) and ( X ′ , S ′ ) are isomorphic if there is a bijection φ : X → X ′ such that
S ′ = φ(S ′ ), where we use the convention that the application of a function to a set is
defined point-wise, i.e., φ( Z ) := { φ( x )| x ∈ Z }. In this case, φ is an isomorphism and we
write ( X, S ) ≃ ( X ′ , S ′ ).
A very simple example of an object is the set X of positions in an DNA or protein
sequence endowed with the total order ≤ since these linear biomolecules have a specified
orientation. The subobjects of ( X, ≤) are obtained as subsets of that preserve the order.
To see that this indeed conforms to the abstract set systems introduced in the previous
paragraph, we note that order relations can be encoded by sets: Setting Sx := {y ∈ X |y ≤ x }
we see that x ≤ y if and only if Sx ⊆ Sy . Thus the order ≤ is encoded by the subset relation
in the set system S := {Sx | x ∈ X }. The subobjects (Y, SY ) induced by Y thus are the set
systems SY = {Sx ∩ Y | x ∈ X }. Similarly, we may consider the class of rooted ordered
trees [3]. As noted, for example, in [5], these correspond to finite sets with an orthogonal
pair of partial orders: two vertices are either equal, comparable w.r.t. to the ancestor order,
or w.r.t. to sibling order. In the present contribution, the objects of interest are graphs.
Here, X denotes the vertices and S corresponds to the edges of a graph, i.e., S is a set of
unordered pairs of distinct vertices. The sub-objects are the induced subgraphs defined by
a subset of vertices.
Abstractly, an alignment of a set of “input” objects ( Xi , Si ) consists of an object ( X, S )
of the same class that contains each of the ( Xi , Si ). More precisely, Xi is contained in X in
the sense that there is an injective function φi : Xi → X such that the subobject ( X, S )[Yi ]
where Yi = φi ( Xi ) is isomorphic to ( Xi , Si ) with φi : Xi → Yi being an isomorphism.
It will be convenient to think of the ( Xi , Si ) as the rows of the alignment in analogy
to multiple sequence alignments. Correspondingly, each x ∈ X specifies a column. Thus,
the alignment can be encoded by a pair ( X, f ) where f : X → {0, 1}n and each component
f i : X → {0, 1} determines a subset Yi := { x ∈ X | f i ( x ) = 1} such that
( X, S )[Yi ] ≃ ( Xi , Si ) . (1)
To avoid trivial cases, we forbid “all-gap columns”, i.e., we insist that f i ( x ) = 1 for
at least one i. Since we assume that the induced subobjects are unique, f defines the
embeddings φi : Xi → X up to isomorphism.
We say that a point x in an alignment corresponds to a match column of f i ( x ) = 1 for
all i. Writing X M = { x ∈ X | f i ( x ) = 1, 1 ≤ i ≤ n} for the set of matches, we observe that
the restriction ( X M , SX M ) of ( X, S ) to X M ⊆ X is a common sub-object of all ( Xi , Si ).
The converse is not always true; however, it is not always possible to extend a common
sub-object to an alignment that contains a common sub-object such as the set of matches.
Probably the most well-known example is the distinction between the editing and alignment
of two rooted ordered forests, see (e.g., [25] Figure 1).
In the case of graphs, however, the situation is simple. Let G and H be two graphs
with a common induced subgraph K with embeddings φG : V (K ) → V ( G ) and φ H :
V (K ) → V ( H ). Identifying the vertices φG ( x ) and φ H ( x ) for all x ∈ V (K ) amounts to
gluing together G and H at the vertices of K. This results in a graph A that contains K as
an induced subgraph. All other vertices belong to either V ( G ) or V ( H ). By construction,
A contains only edges that are present in at least one of G and H, and both G and H are
contained in A as induced subgraphs. In particular, therefore, A is an alignment of G and
Algorithms 2024, 17, 116 5 of 23
H. It is, in fact, the unique edge-minimal alignment of G and H given the common induced
subgraph K.
an arbitrary binary guide tree, it suffices to find a maximum common induced graph and
use it to determine how to glue the two child-graphs together. Note that at this point, we
make no statement on the optimality of the either ( X, S ) or any of the intermediates. We
merely state that every graph alignment ( X, S ) can be obtained by reverting the process
of its decomposition along any binary tree T. Ambiguous edges only provide a moderate
complication in progressive alignments, which we will consider below in conjunction of
the scoring model. In fact, when extending a set of matches, one can choose whether an
ambiguous edge is considered to be present or absent.
−xx −xx
−xx x−x
x−x xxx
xxx
x−x xxx
x−x
xxx x−x
xx− x−−
−−x −−x
−x −x
−x x− xx x−
xx
x−
x− xx
xx xx x− x−
Figure 2. Progressive graph alignment of three graphs (bottom) along a guide tree (fat red edges).
The matching of vertices is shown by vertical lines. Two of the ambiguous edges in the pairwise
alignment, which are inserted due to the matches with the third graph, are shown in green.
2.5. Labels
Usually, the input objects Yi are endowed with a labeling function ℓi : Yi → L, where
L is a finite set of labels. For biological sequence alignments, ℓi ( x ) denotes the nucleotide
or amino acid at sequence position x ∈ Yi . Similarly, if the graphs designate structural
formulae of organic molecules, then ℓi ( x ) is the chemical element of atom x in molecule
Yi . Often, alignments are specified directly in terms of these labels. Thus, we may set
ℓ̃ : X → ( L ∪ {-})n such that ℓ̃i ( x ) = ℓi ( x ) if f i ( x ) = 1 and ℓ̃i ( x ) = - if f i ( x ) = 0. The
gap symbols - therefore correspond exactly to the points (i.e., alignment columns) in X
that are deleted in Yi . Since we have f i ( x ) = 1 if ℓ̃i ( x ) ∈ L and f i ( x ) = 0 if ℓ̃i ( x ) = -, we
can equivalently specify the alignment by ( X, ℓ̃). The labeled input objects (Yi , ℓi ) are thus
obtained from ( X, ℓ̃) by deleting from X all vertices with ℓ̃i ( x ) = - and retaining, on the
i-th label, ℓ̃i ( x ) for all other points.
The labels can be used to impose additional constraints on the common sub-objects.
For instance, one may want to restrict matches to vertices with same labels, in which case
ℓi ( x ) ̸= - and ℓ j ( x ) ̸= - implies ℓi ( x ) = ℓ j ( x ). This is useful, e.g., to preserve atom labels.
In principle, arbitrary rules for the (in)compatibility of labels in a column can be defined.
is again a set of matches and thus defines again a common sub-object, the set M of all sets
of matches forms an independence system, i.e., M ∈ M and M′ ⊆ M implies M ∈ M′ . We
shall consider here only scoring functions that are defined on sets of matches, i.e., the score
of a pairwise alignment is given by σ ( M ) where M is the set of matches corresponding to
match columns X M .
We say that a scoring function σ : M → R is strictly monotone if M′ ⊊ M implies
σ ( M′ ) < σ( M). For strictly monotone scoring function, the maximum score can be attained
only for sets of matches that are maximal, i.e., that cannot be augmented by an additional
match. In the case of graphs, furthermore, every set of matches defines an alignment (which
is unique up to ambiguous edges), and thus an optimal alignment is determined by a MCIS
that in addition maximizes the scoring function σ. The score may also depend on edge
labels within the common induced subgraphs as long as the strict monotonicity property
is preserved.
A convenient special case is an additive scoring scheme in which every vertex-match
and every edge-match yield a non-negative contribution. The inclusion of additive edge-
dependent scores is unproblematic from an algorithmic point of view because any proce-
dure that adds or removes a match xx ′ also adds or removes the incident edges { xx ′ , yy′ }
connecting xx ′ to the vertices yy′ in the rest of the common induced subgraph. Thus, every
edge in the common induced subgraph is associated with a unique vertex operation, ensur-
ing that adding/removing of single vertices affects the total score in a consistent manner.
Assuming a strictly monotone scoring model, ambiguous edges are easy to handle.
Suppose we attempt to add a match x1 x2 to a set M of matches. Then, the edge { x1 x2 , y1 y2 }
to some match y1 y2 is added if either both x1 y1 and x2 y2 are ambiguous edges, or one of
the two edges is unambiguous and the other is ambiguous. No edge is inserted if there
is a non-edge between x1 and y1 or x2 and y2 . Note that if both the edges x1 y1 and x2 y2
are ambiguous, then { x1 x2 , y1 y2 } is again ambiguous. In either case, the extension of M
by x1 x2 by assumption incurs a positive score increment and thus yields a better score
than leaving x1 and x2 as two unmatched vertices. Finally, we note that since ambiguous
edges are removed in all branches at some lower-down node of the guide tree, they do not
convey information on the input graphs. Thus, an edge { x1 x2 , y1 y2 } where x1 y1 or y1 y2 is
ambiguous naturally does not yield a positive score even when otherwise edge-matches
are associated with score contributions.
The most natural scoring function for alignments of alignments is the sum-of-pairs
scoring model, in which the contributions of all pairwise alignments are simply added. In
our setting, this is particularly appealing since gap characters (-) simply do not contribute
to the scoring at all. Note that the sum-of-pair scoring also preserves strict monotonicity,
since each extending match by construction yields non-zero contributions for at least one
row in each of the aligned alignments.
3. Algorithmic Considerations
3.1. Construction of the Guide Tree Based on Graph Kernels
Every binary tree T with leaves that are in a one-to-one correspondence with the
input graphs may serve as guide tree for a progressive alignment. It is well known for
multiple alignments of sequences, however, that the guide tree influences the quality of the
alignment [28,29]. In the case of graph alignments, the computational effort for computing
MCIS, and thus for the pairwise alignment problems, also depends on the similarity of the
input graphs. It is desirable, therefore, to use a guide tree along which for each inner node,
the two children correspond to graphs that are as similar as possible. The most useful guide
trees thus correspond to a parsimonious hierarchical clustering of the input graphs [16]. In
the case alignments of homologous genetic sequences, a good approximation of the correct
phylogeny is ideal. In practice, good guide trees are obtained by hierarchical clustering of
the input objects.
The progressive computation of the multiple alignment requires O( N ) pairwise align-
ments. In contrast, all O( N 2 ) pairwise comparisons are required for hierarchical clustering
Algorithms 2024, 17, 116 8 of 23
to estimate the guide tree. However, pairwise alignments can be used to compute the
distances or similarities of all pairs of input objects. It is therefore desirable to replace
the alignment-based similarities by a computationally less expensive approximation for
the computation of the guide tree. Comparisons of two graphs based on MCIS [30], i.e.,
d MCIS ( G, H ) := |V ( G )| + |V ( H )| − 2|MCIS( G, H )|, as well as other forms of graph editing
distance [31] are NP-complete.
Among the heuristic alternatives are distance measures based on semi-definite graph
kernels [32], i.e., bilinear functions κ : G × G → R on a non-empty set G satisfying
∑in=1 ∑nj=1 ci c j κ ( Gi , Gj ) ≥ 0 for all G1 , . . . , Gn ∈ G and any c1 , . . . , cn ∈ R. The kernel
function κ in essence provides a similarity measure between graphs.
Furthermore, as shown in [33,34], due to its relation with inner products in vector
spaces, the similarities κ ( Gi , Gj ) can be transformed into a distance measure d( Gi , Gj ) by
taking the square root of the value κ ( Gi , Gi ) + κ ( Gj , Gj ) − 2κ ( Gi , Gj ). Both the similarities
and its associated distances can be used as inputs for a simple agglomerative hierarchical
clustering procedure, here simply WPGMA [22], to infer a guide tree T.
Originally conceived in cheminformatics, a wide array of graph kernels has been stud-
ied depending on different properties and features of graphs, including paths, walks, cycles,
spanning trees, matchings, local subgraphs, etc.; see [35] for a survey. Implementations of
kernel functions for graphs (including labeled, weighted, and directed graphs) are available
as widely used software packages such as graphkit-learn [36]. Here, we made use of
the Structural_Shortest_Path kernel implementation of the graphkit-learn library in
order to produce our kernel-based guide trees. As shown in Appendix A.1, the kernel-based
distance and the MCIS-based distance are well correlated.
where anchors lead to a drastic reduction in the search space since only the subtree rooted
at the anchor matches M∗ needs to be processed.
If the MCIS is expected to be small compared to both G1 and G2 , then a direct, VF2-like
expansion of the matches set M appears to be the most promising strategy. If the solution
is expected to cover most of G2 , a viable alternative is to “trim” G2 by removing some of its
vertices; see Algorithm 1. An optimized subgraph isomorphism test, denoted by VF2_sgi()
in Algorithm 1, can then be used to decide whether the trimmed induced subgraph G2′ of
G2 appears as a subgraph in G1 . This allows us to use many of the optimizations in VF2 that
are not applicable if all maximal common subgraphs need to be enumerated. The iteration
over the sets S of removed vertices can be restricted to terminate at the cardinality |S| at
which the first match M is encountered as in our current implementation. While a recursive
version of such a trimming algorithm could also be implemented, it seems difficult to make
use of latter bound in such a setting. Alternatively, it may exclude S if a non-empty set of
matches was returned already for one of its subsets. We note that for arbitrary monotone
scores, the latter strategy must be employed since there is no guarantee that the MCIS with
the maximum score is also maximum in cardinality.
Algorithm 1: Iterative_Trimming( G1 , G2 , M∗ )
Data: Graphs G1 and G2 with |V ( G1 )| ≥ |V ( G2 )|, anchor matching M∗
Result: Set M of maximal common induced subgraphs
M ← ∅; R ← vertices of G2 not part of M∗ ;
for all subsets S of R in order of increasing cardinality do
// the iteration over S can be restricted, see text
G ′ ← G2 [V ( G2 ) \ S];
M ← M ∪ VF2_sgi( G1 , G ′ , M);
// VF2_sgi( G1 , G ′ , M∗ ) returns all embeddings M of G ′ in G with
M∗ ⊆ M, if any
end
return M
Algorithm 2: VF2_step( G1 , G2 , M )
Data: Graphs G1 and G2 , with total order ≤ over V ( G2 ), a matching
M ∈ V ( G1 ) × V ( G2 )
Result: Set M of all matchings between G1 and G2 .
Mleaf ← true;
P ← candidate_matches( G1 , G2 , M);
for ( x, y) ∈ P do
if compatible( M, ( x, y)) then
M ← M ∪ {( x, y)} ;
Mleaf ← false ;
VF2_step( G1 , G2 , M) ;
end
end
if Mleaf then
M ← M ∪ { M}
end
Algorithm 3: candidate_matches( G1 , G2 , M )
Data: Graphs G1 , G2 and a matching M ⊂ V ( G1 ) × V ( G2 )
Result: Set P of match candidates for extending M
N1 ← set of unmatched vertices in V ( G1 );
N2 ← set of unmatched vertices in V ( G2 );
// the anchor M∗ is contained in M and would be excluded from the
candidates
n2 ← max(0, max{y|( x, y) ∈ M });
P ← {( x, y)| x ∈ N1 , y ∈ N2 , y > n2 , };
return P
The original formulation of the VF2 algorithm differs from our adaptation VF2_step()
for MCIS-search by using a different algorithm candidate_matches() for proposing can-
didates for matches that extend M. The version used in VF2 is optimized to determine
whether H is isomorphic to a subgraph of G by restricting the candidates to combinations
of unmatched neighbors y1 of x1 and y2 of x2 for every match x1 x2 , provided such vertices
exist. This restriction, however, makes the assumption that it is possible to extend the
matching M as long as there are still unmatched vertices in connected components that
contain vertices in M. This is true if H is isomorphic to a subgraph of G (or vice versa), but
fails for the MCIS search. A detailed example showing that the standard implementation
of VF2 can fail to find an MCIS is given in Appendix A.2. The candidate search used by
VF2_sgi() inside Iterative_Trimming() is the same in the original VF2 approach.
a relation that defines when two elementary vertex labels or two elementary edge labels
are compatible. This pairwise compatibility rule can be chosen arbitrarily by the user. For
alignments of alignments, label sets are compatible if and only if all pairs of labels are
compatible, where gaps - are assumed to be compatible with all labels.
In order to keep track of ambiguous edges, it is useful to consider a tri-partitions of
V ( G ) × V ( H ) into unambiguous edges E, unambiguous non-edges Q, and ambiguous
edges A. For simplicity, we write x1 x2 for the matching edges.
Since semantic consistency in our setting is defined exclusively in terms of vertex and
edge labels, the semantic filter can, in principle, be integrated with the syntactic consistency
check. The discussion in the theory section immediately implies that a matching M is
consistent if and only if:
(a) the two vertex labels λ( x1 ), λ( x2 ) and the two vertex labels λ(y1 ), λ(y2 ) are compatible
and
(b) for all x1 x2 , y1 y2 ∈ M one of the following conditions is satisfied:
(i) { x1 , y1 } ∈ E( G ) and { x2 , y2 } ∈ E( H ) and the two edge labels λ( x1 , y1 ) and
λ( x2 , y2 ) are are compatible, or
(ii) { x1 , y1 } ∈ Q( G ) and { x2 , y2 } ∈ Q( H ), or
(iii) { x1 , y1 } ∈ A( G ) or { x1 , y1 } ∈ A( H ).
Clearly, the three conditions (i), (ii), and (iii) are mutually exclusive. Moreover, if M is
consistently matching and x1 x2 is a candidate for extension, then M ∪ { x1 x2 } is consistent
if and only if the λ( x1 ) and λ( x2 ) are compatible and for every y1 y2 ∈ M, one of the
conditions (i), (ii), or (iii) is satisfied. Since each of the three conditions can be checked in
constant time, testing whether M can be extended by x1 x2 requires O(| M |) time.
These relationships immediately generalize to directed graphs: compatibility condition
(i) then requires that also the direction of the edges matches, i.e.,
(i’) If x1 , y1 and x2 , y2 are connected by directed edges, then ( x1 , y1 ) ∈ E( G ) iff ( x2 , y2 )
and the labels of these directed edges, λ( x1 , y1 ) and λ( x2 , y2 ), are compatible.
Ambiguous edges, on the other hand, are consistent with all edge directions and edge
labels, as well as the absence of an edge. For a consistent matching M, the corresponding
graph K ( M) and its ambiguous edges are uniquely defined by M as follows: In case (i), we
have { x1 x2 , y1 , y2 } ∈ E(K ( M )). In case (ii), we have { x1 x2 , y1 , y2 } ∈ Q(K ( M)). In case (iii),
we have to distinguish the following cases:
(1) If { x1 , y1 } ∈ A( G ) and { x2 , y2 } ∈ E( H ), or { x1 , y1 } ∈ E( G ) and { x2 , y2 } ∈ A( H ),
then { x1 x2 , y1 , y2 } ∈ E(K ( M));
(2) If { x1 , y1 } ∈ A( G ) and { x2 , y2 } ∈ Q( H ), or { x1 , y1 } ∈ Q( G ) and { x2 , y2 } ∈ A( H ),
then { x1 x2 , y1 , y2 } ∈ Q(K ( M));
(3) If { x1 , y1 } ∈ A( G ) and { x2 , y2 } ∈ A( H ), then { x1 x2 , y1 , y2 } ∈ A(K ( M)).
Again, K ( M ∪ { x1 x2 }) can be constructed from K ( M) is given with O(| M|) effort. The
generalization to directed graphs is straightforward: In cases (2) and (3), the direction of
the edge(s) in the alignment is defined by the edges in either E( G ) or E( H ), respectively.
Since we store the binary vector f ( x ) for each alignment column x, it is not necessary
to explicitly store the sets E, Q, and H. Instead, it suffices to store the unambiguous edges
E and to use ( x, y) ∈ A if and only if f ( x ) ⊙ f (y) = 0, where ⊙ denotes component-wise
binary multiplication.
protein structures [26]. Moreover, this kind of matrix representation contains no information
of the graph structure.
Figure 3. Visualization of a graph alignment of three input graphs. Top: matrix representation
highlighting presence/absence of input graphs in the alignment columns (rows). Vertex-labels may
be represented by colors. Middle: each of the three input graphs is shown embedded in the alignment
graph. Matched vertices are shown in corresponding positions and with corresponding colors.
Bottom: graph of the multiple graph alignment.
Alternatively, one starts from a planar layout of the alignment graph G. Since every
input graph Gi is an induced subgraph of G, an embedding of Gi can be obtained by
retaining only vertices and edges of Gi . As shown in Figure 3, spatial correspondence of
the layouts emphasizes the embedding of Gi in G.
Still, this representation is not always easy to read. An improvement is obtained by
superimposing the embeddings of the input graphs Gi in stacked planes above the embed-
ding of G such that each set of aligned vertices lies on a common vertical line (Figure 4).
Coloring the vertices depending on their position in G further improves readability.
Algorithms 2024, 17, 116 13 of 23
4. Computational Results
4.1. Implementation
As a proof of concept, we developed the Progressive Graph Alignment toolkit
ProGrAlign in Python language, making extensive use of the NetworkX [46], a library that is
widely used for graph analysis. ProGrAlign is publicly available in a Github repository [47].
It consists of three software tools: ProGrAlign_Analysis computes the progressive graph
alignment using a list of graphs built as NetworkX objects. The graphs can be undirected,
directed, labeled or weighted, and may have loops. It is necessary that either all input
graphs are undirected or all input graphs are directed. Functions to convert undirected
graphs into symmetric directed graphs, or directed graphs into their underlying undirected
graphs, are readily available in NetworkX. Input and output use the binary files produced
with the Pickle [48] standard package for the serialization of Python objects. Additionally,
two visualization tools are provided, ProGrAlign_Vis2D and ProGrAlign_Vis3D, which
make use of the Matplotlib [49] library to display the alignment graph.
The mutation operator acting over a connected graph G first deletes one or two vertices,
such that G remains connected, and then adds one or two vertices and inserts, where for
each of them, a number of edges is chosen uniformly from the degree sequence of G. If two
vertices are added, one may be chosen as a neighbor of the other during edge addition. The
creation of loops is not allowed. Random labels are then associated with the new vertices
and new edges.
The set of test graphs is produced by applying the mutation operator along a planted
rooted binary tree rooted by G0 . The single child G1 of G0 has two children G2 and G3 ,
and four grandchildren G4 to G7 . Again, we exclude trivial examples by requiring con-
nectedness and removing graphs that by chance are isomorphic to a previously generated
one. Standard functions provided by NetworkX are used for this purpose. Starting with
16 vertices at G0 , the graphs Gi , 1 ≤ i ≤ 7 have between 13 and 19 vertices. All tests are
preformed with 50 independent sets of test graphs.
14
25 14
12
VF2_step
20 12 10
8
15 10 6
10 4
MCIS-Distance
Average order
Average order
8
1/7 2/7 3/7 4/7 5/7 6/7 7/7 1/7 2/7 3/7 4/7 5/7 6/7 7/7 1/7 2/7 3/7 4/7 5/7 6/7 7/7
14
14
Iterative_Trimming
25
12
20 12 10
8
15
10 6
10 4
1/7 2/7 3/7 4/7 5/7 6/7 7/7 8
1/7 2/7 3/7 4/7 5/7 6/7 7/7 1/7 2/7 3/7 4/7 5/7 6/7 7/7
Voting threshold
(a) (b) (c)
Figure 5. Consensus graphs as function of the voting threshold α. (a) The size of the consensus, as
expected, decreases monotonously with increasing fraction α of non-gaps in the retained alignment
columns. (b) Correspondingly, the fraction of the vertices of the input graphs that are retained in
the consensus decreases. (c) The distance between the reference graph G0 and the alignment of its
noisy offspring G1 through G7 , however, reaches a minimum when only the columns that contain
less then 50% gaps are used to form the consensus. In each panel, we compare guide trees computed
with kernel-based similarity (full lines), random guide trees without considering ambiguous edges
(dashed lines), and random guide trees taking ambiguous edges into account (dotted lines). The
latter are nearly indistinguishable form the kernel-based guide trees. Here line colors are a visual aid
to distiguish the methodologies and their variations outlined before.
Pearson correlation
140 = 0.99
Number of ambiguous edges
120
100
80
60
40
20
50 60 70 80 90 100 110
Number of gaps
Figure 6. Correlation of the number of ambiguous edges with the number of gaps. The values here
correspond to the 400 alignments resulting from running the 8 experiments, described in Figure 7,
over our set of 50 scenarios.
Algorithms 2024, 17, 116 16 of 23
Tk Tk Tr Tr Sk Sk Sr Sr
Experiments
Figure 7. Comparison of running times [s] of the eight experiments carried each over the 50 scenarios:
T and S refer to the use of Iterative_Trimming and VF2_step, respectively. Kernel-based (k) or
random (r) guide trees show a moderate but systematic advantage of a kernel-based similarity. The
exclusion (⃝) or inclusion (■) of ambiguous edges is also compared.
5. Concluding Remarks
Progressive alignments of seven graphs with 16–19 vertices can be computed in about
10 s using the prototype implementation ProGrAlign. On our test sets, consensus graphs,
defined to contain alignment columns in which at least half of the input graphs are included,
are a very good approximations of the reference graph G0 . A difference as low as 3.5 is
comparable to the smallest differences (2–4, see Figure A1 in Appendix A.1) between G0
and its variations Gi , 1 ≤ i ≤ 7, in each of the 50 scenarios. Discrepancies between methods
are small, suggesting that solutions are close to optimal.
Algorithms 2024, 17, 116 17 of 23
The graph alignments considered here differ from related concepts by explicitly consid-
ering alignments as super-graphs. In contrast, related work considers only correspondences
of vertices, and, i.e., “alignment columns”, without specifying edges and without strictly
enforcing the conservation of adjacency. The main advantage of considering alignments as
graphs is the resulting “compositionality”, which makes it possible to build up multiple
alignments in a step-wise fashion. The property that every input graph is recovered through
a well-defined projection operation, furthermore, makes MGAs a proper generalization of
multiple sequence alignments and the ordered forest alignments used for comparing RNA
structures. Vertex matching procedures do not share these properties.
An important concept that arises from the idea that alignments of graphs are again
graphs are the ambiguous edges in the alignment, i.e., edges in an alignment graph that
never appear in the projection of any of the input graphs. Note, however, that ambiguous
edges may be unambiguously preserved in the restriction of an alignment graph to a
subset S of rows: this is the case if neither of its incident vertices is reduced to an all-
gap column in the restriction to S. Ambiguous edges are analogs of the ambiguity in
the relative order of insertions and deletions between two consecutive match columns in
sequence alignments. Sequence alignments can be improved by realigning such regions
with the aim of “harmonizing” indel patterns [51]. From a theoretical point of view, such
ambiguities have been studied with regard to their handling in dynamic programming
algorithms [52]. In the case of (progressive) graph alignments, the analogous ambiguities
are at least alleviated by considering ambiguous edges.
We emphasize that MGAs are not intended as a replacement for vertex-matching
procedures. In applications that require only local similarities of subgraphs, the stringent
definition of MGAs is usually not necessary and arguably computationally too expen-
sive. Well-defined consensus graphs, on the other hand, require that the MGA itself is
well-defined graph. We expect that the framework will prove useful in particular in appli-
cations to evolutionary biology, where alignments are used because they not only represent
similarity but also convey a notion of evolutionary relatedness.
The implementation of the MGA described here is intended as a proof-of-concept,
showing the feasibility and potential usefulness of the concept, which has been introduced
from a purely theoretical perspective in [5]. Alignments of moderate-size and small graphs
are, in particular, appealing for applications to molecular graphs. For these, atom labels and
bond orders impose strong semantic constraints on the matches, which limit the computa-
tional efforts. For larger-scale applications, however, further algorithmic improvements
will be necessary.
In principle, it is possible to endow VF2_step() with a bound on the number of
possible extending matches that reduce the number of branches of the search tree akin to the
McSplit [40] algorithm. However, the evaluation of such bounds must be computationally
cost-effective because it is estimated in every state of the search space. While this seems
possible provided the compatibility tests set by the user allow only the match of vertices
(and edges) having the same label, this does not seem to be an easy task when optimizing
more general scoring schemes or broader compatibility rules. Moreover, ambiguous edges
may lead to very loose bounds that are of little practical use. A simple implementation
of the bound used in McSplit did not lead to the speeding-up of our program. It remains
to be determined if this is a consequence of the extensive candidate set of candidates for
extending the match set M, or whether this an issue arising from inadequate data structures.
An effective branch-and-bound algorithm for graph alignments thus remains a topic for
ongoing research.
Although MGAs provide consensus graphs that do not grow in size, the alignment
graph itself may grow linearly in size with the number of input graphs, in particular for
pairwise dissimilar input graphs. It seems possible to speed up alignments by restricting
the early match sets M to columns with few gaps and thus large scores. Subgraph-based
kernels might also be used to find matches between vertices that are likely contained in
maximal common induced subgraphs with large scores.
Algorithms 2024, 17, 116 18 of 23
Author Contributions: M.E.G.L. developed the algorithms, implemented the software, and per-
formed the computational analysis. P.F.S. designed the study. Both authors contributed to the
theoretical results and the writing of the manuscript. All authors have read and agreed to the
published version of the manuscript.
Funding: The authors acknowledge financial support by the Federal Ministry of Education and Re-
search of Germany and by the Sächsische Staatsministerium für Wissenschaft Kultur und Tourismus
in the program Center of Excellence for AI-research “Center for Scalable Data Analytics and Artificial
Intelligence Dresden/Leipzig”, project identification number: ScaDS.AI. PFS acknowledges support
by the German Federal Ministry of Education and Research BMBF through DAAD project 57616814
(SECAI, School of Embedded Composite AI). Publication costs were covered by the Open Access
Publishing Fund of Leipzig University supported by the German Research Foundation within the
program Open Access Publication Funding.
Data Availability Statement: The software package as well as scripts to re-run the validation
experiments are available on a Github repository [47].
Acknowledgments: We thank Maria Waldl and Jakob Lykke Andersen for their valuable comments
on the manuscript.
Conflicts of Interest: The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
Appendix A
Appendix A.1. Additional Information on Graph Similarities
The mutation test graphs were used to investigate the correlation kernel-based and
MCIS-based distance measures. Empirically, we find a good correlation indicating that
the kernel-based distance, which can be evaluated much more efficiently, is a decent
approximation for the purpose of constructing the guide tree (Figure A1).
Algorithms 2024, 17, 116 19 of 23
In order to better characterize the set of test graphs, we show the distribution of
MCIS-sizes for pairs of graphs. Typically, the MCIS covers more than half of both graphs,
explaining why Iterative_Trimming() is more efficient than VF2_step() on this data set.
MCIS-based Distance
10 60
8
G0 vs Gi : 40
6
4 20
2
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 8 9 10 11 12 13 14 15 16
Kernel-based Distance Order of MCIS between G0 and mutants
Linear Regression: Distance between Pairs of Mutants 250
mean: 13.13 vertices
14 Pearson correlation
= 0.75946
10 150
8
Gi vs Gj : 100
6
4 50
2
0
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 8 9 10 11 12 13 14 15 16
Kernel-based Distance Order of MCIS between pairs of mutants
Figure A1. Left: correlation between kernel distance and MCIS-based distance. Right: distribution of
|MCIS|. The panels on the top compare the initial graph to its mutants. The panels below compare
pairs of mutants over each of the 50 scenarios.
h f e 5 4 1
( G1 ) g c a ( G2 ) 3 2 7
d b 6 8
Figure A2. Examples of a graph with given vertex order for which the standard choice of candidates,
i.e., candidate_matches_original, does not result in the correct MCIS.
Similar examples show that other strategies such as “taking the feasible pair from the
minimum” instead of the “minimum pair from the feasible” also fail for certain inputs. In
general, the difference lies in that the VF2 assumes that there should be at least one match
mapping every vertex in these graphs if they are indeed isomorphic, a condition that is not
compatible with the MCIS search.
6 6
Number of scenarios
Number of scenarios
5 5
4 4
(a) 3 (b) 3
2 2
1 1
0 0
20 40 60 80 100 120 140 20 40 60 80 100 120 140
Number of Ambiguous Edges in Alignments Number of Ambiguous Edges in Alignments
obtained with kernel-based GT obtained with random-based GT
Figure A3. Distribution of ambiguous edges present in (a) alignments obtained when using the
kernel-based guide trees, and (b) when using the random guide trees.
Number of scenarios
14 8
12
10 6
(a) 8 (b)
6
4
4
2
2
0 0
0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140
Running time of Iterative_Trimming [s], Running time of VF2_step [s],
with kernel-based GT and without ambiguous edges with kernel-based GT and without ambiguous edges
Figure A4. Running times of both algorithms, (a) Iterative_Trimming and (b) VF2_Step, tested
together with kernel-based guide trees but ignoring ambiguous edges.
Algorithms 2024, 17, 116 21 of 23
6.5
6
30 6.0
4
6 55
3.5
4 60
3.0
2 65
2.5
(25, 30) (30, 35) (35, 40) (40, 45) (45, 50) (50, 55) (55, 60) (60, 65) (65, 70) 70
Intervals of proportion of edges Order of MCIS between subsets of random graphs
Figure A5. Order of MCIS between random graphs generated by pairs of order and proportion of
edges, chosen uniformly and labeled uniformly at random. (a) Shows average variations against each
variable, while (b) shows the variations against the combination of these parameters.
In Figure A6, we show the running time taken for each algorithm to complete the
MCIS-search in this data set. Specifically, Figure A6(a1,a2) show the running time of
Iterative_Trimming, while Figure A6(b1,b2) show the one of VF2_step. The running time
of the algorithms grows according to the order of the MCIS. However, VF2_step appears to
grow inversely to the proportion of edges. Since the running time of the VF2_step must
increase according to the number edges of a graph, this behavior can be better explained by
the correlation between the running time taken by this routine with the order of the MCIS
it has to uncover, shown in Figure A5b. This suggests that VF2_step is better suited for
detecting MCISs that are comparatively smaller than the graphs containing them.
40 25 35
30
20 30
Intervals of proportion of edges
35
Average running time [s]
0 25
40
8 9 10 11 12 13 14 15 16
(a1) Order (a2) 45 20
50 15
Running time [s]
40
55
10
20 60
5
65
0
(25, 30) (30, 35) (35, 40) (40, 45) (45, 50) (50, 55) (55, 60) (60, 65) (65, 70) 70 0
Intervals of proportion of edges Iterative_Trimming
25 0.7
1.0
30
0.6
Intervals of proportion of edges
0.5
35
Average running time [s]
0.5
0.0 40
8 9 10 11 12 13 14 15 16
0.4
(b1) Order (b2) 45
50 0.3
Running time [s]
1.0 55
0.2
0.5 60
0.1
65
0.0
(25, 30) (30, 35) (35, 40) (40, 45) (45, 50) (50, 55) (55, 60) (60, 65) (65, 70) 70 0.0
Intervals of proportion of edges VF2_step
Figure A6. Running time of MCIS-search over connected random graphs. Top corresponds to
Iterative_Trimming and bottom to VF2_step. Again (a1,b1) show average variations against each
variable, while (a2,b2) show the variations against pairs of order and proportion of edges.
Algorithms 2024, 17, 116 22 of 23
References
1. Rosenberg, M.S. (Ed.) Sequence alignment: Concepts and history. In Sequence Alignment: Methods, Models, Concepts, and Strategies;
University of California Press: Oakland, CA, USA, 2009; pp. 1–22. [CrossRef]
2. Chatzou, M.; Magis, C.; Chang, J.M.; Kemena, C.; Bussotti, G.; Erb, I.; Notredame, C. Multiple sequence alignment modeling:
Methods and applications. Brief. Bioinform. 2015, 17, 1009–1023. [CrossRef]
3. Jiang, T.; Wang, L.; Zhang, K. Alignment of trees—An alternative to tree edit. Theor. Comput. Sci. 1995, 143, 137–148. [CrossRef]
4. Höchsmann, M.; Voss, B.; Giegerich, R. Pure multiple RNA secondary structure alignments: A progressive profile approach.
Trans. Comput. Biol. Bioinform. 2004, 1, 53–62. [CrossRef]
5. Berkemer, S.; Höner zu Siederdissen, C.; Stadler, P.F. Compositional properties of alignments. Math. Comput. Sci. 2021, 15, 609–630.
[CrossRef]
6. Berg, J.; Lässig, M. Local graph alignment and motif search in biological networks. Proc. Natl. Acad. Sci. USA 2004,
101, 14689–14694. [CrossRef]
7. Kuchaiev, O.; Milenković, T.; Memisević, V.; Hayes, W.; Pržulj, N. Topological network alignment uncovers biological function
and phylogeny. J. R. Soc. Interface 2010, 7, 1341–1354. [CrossRef]
8. Mernberger, M.; Klebe, G.; Hüllermeier, E. SEGA: Semiglobal graph alignment for structure-based protein comparison. IEEE/ACM
Trans. Comput. Biol. Bioinform. 2011, 8, 1330–1343. [CrossRef] [PubMed]
9. Weskamp, N.; Hüllermeier, E.; Kuhn, D.; Klebe, G. Multiple graph alignment for the structural analysis of protein active sites.
IEEE/ACM Trans. Comput. Biol. Bioinform. 2007, 4, 310–320. [CrossRef] [PubMed]
10. Singh, R.; Xu, J.X.; Berger, B. Global alignment of multiple protein interaction networks with application to functional orthology
detection. Proc. Natl. Acad. Sci. USA 2008, 105, 12763–12768. [CrossRef]
11. Zhang, S.; Tong, H. FINAL: Fast attributed network alignment. In Proceedings of the KDD ’16: Proceedings of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016;
Krishnapuram, B., Shah, M., Smola, A., Aggarwal, C., Shen, D., Eds.; Association for Computing Machinery: New York, NY, USA,
2016; pp. 1345–1354. [CrossRef]
12. Heimann, M.; Lee, W.; Pan, S.; Chen, K.Y.; Koutra, D. HashAlign: Hash-based alignment of multiple graphs. In Proceedings of
the Advances in Knowledge Discovery and Data Mining, PAKDD 2018, Melbourne, VIC, Australia, 3–6 June 2018; Phung, D.,
Tseng, V., Webb, G., Ho, B., Ganji, M., Rashidi, L., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2018;
Volume 10939, pp. 726–739. [CrossRef]
13. Bayati, M.; Gleich, D.F.; Saberi, A.; Wang, Y. Message-passing algorithms for sparse network alignment. ACM Trans. Knowl.
Discov. Data 2013, 7, 3. [CrossRef]
14. Tang, J.; Zhang, W.; Li, J.; Zhao, K.; Tsung, F.; Li, J. Robust attributed graph alignment via joint structure learning and optimal
transport. In Proceedings of the IEEE 39th International Conference on Data Engineering (ICDE), Los Alamitos, CA, USA, 3–7
April 2023; pp. 1638–1651. [CrossRef]
15. Malmi, E.; Chawla, S.; Gionis, A. Lagrangian relaxations for multiple network alignment. Data Min. Knowl. Discov. 2017,
31, 1331–1358. [CrossRef]
16. Feng, D.F.; Doolittle, R.F. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 1987,
25, 351–360. [CrossRef]
17. Wang, L.; Jiang, T. On the complexity of multiple sequence alignment. J. Comput. Biol. 1994, 1, 337–348. [CrossRef]
18. Just, W. Computational complexity of multiple sequence alignment with SP-score. J. Comput. Biol. 2001, 8, 615–623. [CrossRef]
19. Elias, I. Settling the intractability of multiple alignment. J. Comput. Biol. 2006, 13, 1323–1339. [CrossRef]
20. Fober, T.; Mernberger, M.; Klebe, G.; Hüllermeier, E. Evolutionary construction of multiple graph alignments for the structural
analysis of biomolecules. Bioinformatics 2009, 25, 2110–2117. [CrossRef]
21. Ngoc, H.T.; Duc, D.D.; Xuan, H.H. A novel ant based algorithm for multiple graph alignment. In Proceedings of the International
Conference on Advanced Technologies for Communications (ATC 2014), Hanoi, Vietnam, 15–17 October 2014; Heath, R.W.,
Quynh, N.X., Lap, L.H., Eds.; IEEE Press: Piscataway, NJ, USA, 2014; pp. 181–186. [CrossRef]
22. Sokal, R.R.; Michener, C.D. A statistical method for evaluating systematic relationships. Univ. Kansas Sci. Bull. 1958, 38, 1409–1438.
23. Cordella, L.P.; Foggia, P.; Sansone, C.; Vento, M. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Trans.
Pattern Anal. Mach. Intell. 2004, 26, 1367–1372. [CrossRef] [PubMed]
24. Alpár Jüttner, A.; Madarasi, P. VF2++—An improved subgraph isomorphism algorithm. Discret. Appl. Math. 2018, 242, 69–81.
[CrossRef]
25. Touzet, H. Comparing similar ordered trees in linear-time. J. Discret. Algorithms 2007, 5, 696–705. [CrossRef]
26. Stadler, P.F. Alignments of biomolecular contact maps. Interface Focus 2021, 11, 20200066. [CrossRef] [PubMed]
27. Morgenstern, B.; Stoye, J.; Dress, A.W.M. Consistent Equivalence Relations: A Set-Theoretical Framework for Multiple Sequence
Alignments; Technical Report; University of Bielefeld, FSPM: Bielefeld, Germany, 1999. [CrossRef]
28. Nelesen, S.; Liu, K.; Zhao, D.; Linder, C.R.; Warnow, T. The effect of the guide tree on multiple sequence alignments and
subsequent phylogenetic analyses. In Pacific Sympomsium on Biocomputing PSB’08; Altman, R.B., Dunker, A.K., Hunter, L., Klein,
T.E., Eds.; Stanford Univ.: Stanford, CA, USA, 2008; pp. 25–36. [CrossRef]
29. Zhan, Q.; Ye, Y.; Lam, T.W.; Yiu, S.M.; Wang, Y.; Ting, H.F. Improving multiple sequence alignment by using better guide trees.
BMC Bioinform. 2015, 16, S4. [CrossRef] [PubMed]
Algorithms 2024, 17, 116 23 of 23
30. Bunke, H. On a relation between graph edit distance and maximum common subgraph. Pattern Recognit. Lett. 1997, 18, 689–694.
[CrossRef]
31. Zeng, Z.; Tung, A.K.H.; Wang, J.; Feng, J.; Zhou, L. Comparing stars: On approximating graph edit distance. Proc. VLDB Endow.
2009, 2, 25–36. [CrossRef]
32. Jia, L.; Gaüzère, B.; Honeine, P. Graph kernels based on linear patterns: Theoretical and experimental comparisons. Expert Syst.
Appl. 2022, 189, 116095. [CrossRef]
33. Schölkopf, B. The kernel trick for distances. In Proceedings of the NIPS’00: Proceedings of the 13th International Conference on
Neural Information Processing Systems, Denver, CO, USA, 1 January 2000; Leen, T., Dietterich, T., Tresp, V., Eds.; MIT Press:
Cambridge, MA, USA, 2000; pp. 283–289. [CrossRef]
34. Phillips, J.M.; Venkatasubramanian, S. A gentle introduction to the kernel distance. arXiv 2011, arXiv:1103.1625.
35. Kriege, N.M.; Johansson, F.D.; Morris, C. A survey on graph kernels. Appl. Netw. Sci. 2020, 5, 6. [CrossRef]
36. Jia, L.; Gaüzère, B.; Honeine, P. Graphkit-learn: A python library for graph kernels based on linear patterns. Pattern Recognit. Lett.
2021, 143, 113–121. [CrossRef]
37. Garey, M.R.; Johnson, D.S. Computers and Intractability. A Guide to the Theory of N P Completeness; Freeman: San Francisco, CA,
USA, 1979.
38. Kann, V. On the approximability of the maximum common subgraph problem. In Proceedings of the 9th Annual Symposium on
Theoretical Aspects of Computer Science; Cachan, France, 13–15 February 1992; Finkel, A., Jantzen, M., Eds.; Lecture Notes in
Computer Science; Springer: Berlin/Heidelberg, Germany, 1992; Volume 577, pp. 375–388. [CrossRef]
39. Barrow, H.; Burstall, R. Subgraph isomorphism, matching relational structures and maximal cliques. Inf. Process. Lett. 1976,
4, 83–84. [CrossRef]
40. McCreesh, C.; Prosser, P.; Trimble, J. A partitioning algorithm for maximum common subgraph problems. In Proceedings of the
Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, Melbourne, Australia, 19–25 August 2017; Sierra,
C., Ed.; AAAI Press: Palo Alto, CA, USA, 2017; pp. 712–719. [CrossRef]
41. Hoffmann, R.; McCreesh, C.; Reilly, C. Between subgraph isomorphism and maximum common subgraph. In Proceedings of the
Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Markovitch, S., Singh, S.,
Eds.; AAAI Press: Palo Alto, CA, USA, 2017; Volume 1, pp. 3907–3914. [CrossRef]
42. Liu, Y.; Zhao, J.; Li, C.M.; Jiang, H.; He, K. Hybrid learning with new value function for the maximum common induced subgraph
problem. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI-37), Washington, DC, USA,
7–14 February 2023; Williams, B., Chen, Y., Neville, J., Eds.; AAAI Press: Palo Alto, CA, USA, 2023; Volume 4, pp. 4044–4051.
43. Berezikov, E.; Guryev, V.; Plasterk, R.H.A.; Edwin, C. CONREAL: Conserved regulatory elements anchored alignment algorithm
for identification of transcription factor binding sites by phylogenetic footprinting. Genome Res. 2004, 14, 170–178. [CrossRef]
44. Morgenstern, B.; Prohaska, S.J.; Pohler, D.; Stadler, P.F. Multiple sequence alignment with user-defined anchor points. Algorithms
Mol. Biol. 2006, 1, 6. [CrossRef]
45. Brun, L.; Gaüzère, B.; Fourey, S. Relationships between Graph Edit Distance and Maximal Common Unlabeled Subgraph; Technical
Report hal-00714879; HAL: Bangalore, India, 2012.
46. Hagberg, A.A.; Schult, D.A.; Swart, P.J. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of
the 7th Python in Science Conference, Pasadena, CA, USA, 19–24 August 2008; Varoquaux, G., Vaught, T., Millman, J., Eds.; 2008;
pp. 11–15.
47. González-Laffitte, M.E.; Stadler, P.F. Github Repository of the Progressive Graph Alignment Software ProGrAlign. 2024. Available
online: https://github.com/MarcosLaffitte/Progralign (accessed on 23 February 2024).
48. Documentation on the Pickle Python Package. Available online: https://docs.python.org/3/library/pickle.html (accessed on 1
March 2024).
49. Hunter, J.D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007, 9, 90–95. [CrossRef]
50. Schneider, T.D. Consensus sequence zen. Appl. Bioinform. 2002, 1, 111–119.
51. Hagiwara, K.; Edmonson, M.N.; Wheeler, D.A.; Zhang, J. indelPost: Harmonizing ambiguities in simple and complex indel
alignments. Bioinformatics 2022, 38, 549–551. [CrossRef] [PubMed]
52. Giegerich, R. Explaining and controlling ambiguity in dynamic programming. In Proceedings of the Combinatorial Pattern
Matching. CPM’00, Montreal, QC, Canada, 21–23 June 2000; Giancarlo, R., Sankoff, D., Eds.; Lecture Notes in Computer Science;
Springer: Berlin/Heidelberg, Germany, 2000; Volume 1848. [CrossRef]
53. Wallace, I.M.; Orla, O.; Higgins, D.G. Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics 2005,
21, 1408–1414. [CrossRef] [PubMed]
54. Sze, S.H.; Lu, Y.; Wang, Q.W. A polynomial time solvable formulation of multiple sequence alignment. J. Comput. Biol. 2006,
13, 309–319. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.