Efficient K-Clique Counting On Large Graphs The Power of Color-Based Sampling Approaches
Efficient K-Clique Counting On Large Graphs The Power of Color-Based Sampling Approaches
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
Abstract—K -clique counting is a fundamental problem in network analysis which has attracted much attention in recent years.
Computing the count of k-cliques in a graph for a large k (e.g., k = 8) is often intractable as the number of k-cliques increases
exponentially w.r.t. (with respect to) k. Existing exact k-clique counting algorithms are often hard to handle large dense graphs, while
sampling-based solutions either require a huge number of samples or consume very high storage space to achieve a satisfactory
accuracy. To overcome these limitations, we propose a new framework to estimate the number of k-cliques which integrates both the
exact k-clique counting technique and three novel color-based sampling techniques. The key insight of our framework is that we only
apply the exact algorithm to compute the k-clique counts in the sparse regions of a graph, and use the proposed color-based sampling
approaches to estimate the number of k-cliques in the dense regions of the graph. Specifically, we develop three novel dynamic
programming based k-color set sampling techniques to efficiently estimate the k-clique counts, where a k-color set contains k nodes
with k different colors. Since a k-color set is often a good approximation of a k-clique in the dense regions of a graph, our
sampling-based solutions are extremely efficient and accurate. Moreover, the proposed sampling techniques are space efficient which
use near-linear space w.r.t. graph size. We conduct extensive experiments to evaluate our algorithms using 8 real-life graphs. The
results show that our best algorithm is at least one order of magnitude faster than the state-of-the-art sampling-based solutions (with
the same relative error 0.1%) and can be up to three orders of magnitude faster than the state-of-the-art exact algorithm on large
graphs.
Index Terms—k-clique counting, Cohesive subgraphs, Graph coloring, Graph sampling, Dynamic programming
1 I NTRODUCTION
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
graph, because the dense regions of the graph may contain • We propose a new algorithmic framework for esti-
many large cliques (with complicated overlap relationship- mating k -clique counting which can circumvent the
s), resulting in a large search tree of PIVOTER (e.g., see the defects of the existing exact and approximation al-
results on the LiveJournal dataset in [17]). gorithms. We show that our framework is extremely
Approximation solutions based on sampling are typ- efficient and accurate. It can achieve a 10−5 relative
ically able to handle large dense graphs when k is not error by sampling a reasonable number of samples.
very large [18], [20], [23]. However, to achieve a desired • We develop three novel DP-based k -color set sam-
accuracy, previous sampling-based solutions either require pling techniques to estimate the number of k -cliques
a huge number of samples [18], [24], [25] or consume very in the dense regions of the graph. Our novelty
high storage space [19], [20], [23], [26], [27] for a relatively is in the algorithmic use of classic graph coloring
large k (e.g., k ≥ 8). Among them, a notable sampling- technique for sampling. The striking features of our
based approximation algorithm is the TuranShadow algo- techniques are that they are not only very efficient
rithm which was proposed by Jain and Seshadhri [20]. As and accurate, but also use near-linear space w.r.t. the
shown in [20], TuranShadow is much faster and more ac- graph size.
curate than the other previous sampling-based algorithms. • We evaluate our algorithms on 8 large real-life graph-
The main limitation of TuranShadow is that it needs to s. The results show that (1) our best algorithm is
take O(nα(k−1) + m) time and O(nα(k−2) ) space to con- at least one order of magnitude faster than the
struct a data structure called Tuŕan Shadow for sampling, state-of-the-art approximate algorithm (PEANUTS)
where α denotes the arboricity of the graph [12]. Therefore, to achieve a 0.1% relative error, using much smaller
on large graphs, TuranShadow is very costly for a large space; and (2) it can be up to three orders of magni-
k . To reduce the space usage of TuranShadow, the same tude faster than the state-of-the-art exact algorithm
authors developed an improved TuranShadow algorithm (PIVOTER) on large graphs. For example, on the
called PEANUTS. PEANUTS adopts an online sampling hardest dataset LiveJournal with k = 8, TuranShadow
solution which does not construct the Tuŕan Shadow offline. takes more than 120 seconds and PIVOTER cannot
However, PEANUTS still needs to build a partial Tuŕan terminate within 5 hours, while our best algorith-
Shadow when estimating the k -clique counts of a sampled m consumes around 20 seconds to achieve a 0.1%
node, which sometimes consumes a lot of space. relative error. Moreover, our algorithms also exhibit
To overcome the limitations of the state-of-the-art algo- an excellent parallel performance which can achieve
rithms, we propose a new framework to estimate the num- 12× ∼ 14× speedup ratios when using 16 threads in
ber of k -cliques in a graph which integrates both the exact our experiments.
PIVOTER algorithm and two newly-developed sampling-
Reproducibility. For reproducibility purpose, the source
based techniques. Our framework is based on a simple
code of this paper is released at https://github.com/
but effective observation: PIVOTER is extremely efficient
LightWant/dpcolor.
to compute the number of k -cliques in the sparse regions of
the graph, while sampling-based solutions are often very Organization. The rest of this paper is organized as follows.
efficient and accurate to estimate the k -clique counts in the In Section 2, we describe several key notations, formulate
dense regions of a graph. Base on this crucial observation, we the problem, summarize several representative existing al-
can first partition the graph into sparse and dense regions. gorithms of k -clique counting, and also analyze the defects
Then, for the sparse regions, we invoke PIVOTER to exactly of these algorithms. In Section 3, we propose a novel sam-
compute the k -clique counts. For the dense regions, we pling framework for k -clique counting. In Section 4, we
propose three novel sampling techniques based on a con- present the DP-based k -color set sampling algorithm. The k -
cept of graph coloring [28] to estimate the k -clique counts. color path and k -triangle path algorithms are developed in
Specifically, we first present a new concept called k -color set Section 5 and Section 6 respectively. Extensive experiments
which denotes a set of k nodes with k different colors. Then, are shown in Section 8. Finally, we survey the related work
we propose a dynamic programming (DP) based k -color set in Section 9 and conclude this work in Section 10.
sampling algorithm to estimate the k -clique counts. Since a
k -color set is typically a good approximation for a k -clique 2 P RELIMINARIES
in the dense regions of a graph, our algorithm is extremely
efficient and accurate. In addition, we also propose a novel Let G = (V, E) be an undirected graph, where V and E
DP-based k -color path sampling and a novel DP-based k - denotes the set of nodes and edges respectively. Let n and m
triangle path sampling techniques to further improve the be the number of nodes and edges of G respectively. Denote
efficiency and accuracy. Here a k -color path is a connected by Nv (G) the set of neighbors of v in G. The degree of v ,
k -color set and a k -triangle path is a k -color path with denoted by dv (G), is the size of the neighbor set of v , i.e.,
any three consecutive nodes forming a triangle. These two dv (G) = |Nv (G)|. Given a subset S of V , we denote by
new concepts are more effective to approximate a k -clique G(S) = (VS , ES ) the subgraph of G induced by S , where
than the k -color set. Moreover, unlike TuranShadow and ES = {(u, v) ∈ E|u, v ∈ S}. A k -clique is a complete
PEANUTS, all of our sampling-based solutions take near- subgraph of G in which every pair of nodes is connected
linear space w.r.t. the graph size. by an edge.
Given a graph G and an integer k , the k -clique counting
Contributions. In summary, the main contributions of this problem is to compute the number of k -cliques in G. Prac-
paper are as follows. tical algorithms for solving the k -clique counting problem
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
Algorithm 1: The PIVOTER Algorithm [17] [22]. The key idea of PIVOTER is that it implicitly builds a
Input: A graph G = (V, E) and an integer k succinct clique tree (SCT) by using the pivoting technique
Output: The number of k-cliques in G in the search procedure. Such a SCT structure maintains a
1 ~ ← the DAG generated by the degeneracy ordering of
G unique representation of all k -cliques, but its size is often
G; much smaller than the number of k -cliques. PIVOTER was
2 ans ← 0; shown to be much faster than the traditional k -clique listing
3 for u ∈ V do based algorithms [15], [16], [17]. Since we will make use of
4 ~ k − 1, 0, 0);
PIVOTER(Nu (G), PIVOTER as a subroutine in our algorithms, we give the
5 return ans; detailed description of PIVOTER in Algorithm 1.
6 Procedure PIVOTER(S, k, p, h) Algorithm 1 first computes a DAG G ~ of G based on the
7 if h > k return; degeneracy ordering (line 1). Then, for each node u ∈ V ,
8 if S = ∅ then
p
the algorithm invokes the PIVOTER procedure to calculate
9 ans ← ans + k−h ;
the number of (k − 1)-cliques in Nu (G) ~ (lines 3-4). In the
10 return;
PIVOTER procedure, it first selects a node with the maxi-
11 pv ← maxu∈S {|Nu (G)~ ∩ S|};
mum number of neighbors in S as a pivot node pv (line 11).
12 PIVOTER(Npv (G) ~ ∩ S, k, p + 1, h);
The candidate set S is then divided into three subsets: {pv},
13 U ← S − Npv (G)~ − {pv}; ~ ∩ S and S − {pv} − Npv (G) ~ . By these three subsets,
Npv (G)
14 for vi ∈ U do
~ ∩ S, k, p, h + 1); the cliques can be classified into three various types: (1) the
15 PIVOTER(Nvi (G) ~ ∩ S,
16 S ← S − {vi }; k -cliques containing nodes in both {pv} and Npv (G)
(2) the k -cliques only containing nodes in Npv (G)~ ∩ S , and
(3) the k -cliques containing nodes in S − {pv} − Npv (G) ~ .
Then, PIVOTER recursively computes the total numbers for
are often based on some ordering-based heuristic techniques these three types of k -cliques (lines 12-16). Note that the
[15], [16], [17], [20]. first two types of k -cliques can be counted by invoking
Let π : V → {v1 , ..., vn } be a total order of the nodes PIVOTER with the input set Npv (G) ~ ∩ S (line 12), whereas
in G. For two nodes u and v of G, we say that π(v) < the last type of k -cliques are iteratively counted for each
π(u) if u comes after v in the ordering of π . Then, based node in S − {pv} − Npv (G) ~ (lines 14-16). The worst-case
on such an ordering, we can obtain a DAG (directed acyclic time complexity of PIVOTER is O(nα3α/3 ) where α is the
graph) G ~ by orienting the edges of the undirected graph G. arboricity [12] of the graph and δ/2 ≤ α ≤ δ . Since α is
Specifically, for each undirected edge (u, v) in G, we obtain often very small in real-life sparse graphs, the PIVOTER
a directed edge (u, v) in G ~ if π(u) < π(v), otherwise we algorithm was shown to be very efficient in practice [17].
get a directed edge (v, u). The k -clique counting problem in The TuranShadow algorithm and its variant. TuranShadow
G is equivalent to computing the number of k -cliques in G ~. is a representative sampling-based approximation algorith-
Existing k -clique counting algorithms that work on the DAG m which was also proposed by Jain and Seshadhri [20].
G~ (instead of the original graph G) can guarantee that each As shown in [20], TuranShadow is much faster and more
k -clique is only explored once, thus significantly improving accurate than the other sampling-based algorithms. The
the efficiency. TuranShadow algorithm first constructs a data structure,
Note that many different ordering heuristics for k -clique called Tuŕan Shadow, based on the classic Tuŕan’s theorem
counting have been developed in the literature [16]. Among which states that a graph must contain a k -cliuqe if the edge
them, a widely-used ordering heuristics is the degeneracy density ρ = m/ n2 satisfies ρ > 1−1/(k−1). Specifically, the
ordering [21], where the degeneracy is a metric to measure Tuŕan Shadow, denoted by S , contains a set of pairs (S, l)
the sparsity of a graph [29]. Specifically, the degeneracy where S is a node set and l ≤ k is an integer. Let GS be the
ordering of nodes in G is defined as an ordering {v1 , ..., vn } subgraph induced by the node set S . For each pairs (S, l),
such that the degree of vi is minimum in the subgraph of G the edge density of GS is larger than 1 − 1/(l − 1), thus
induced by {vi , ..., vn } for each vi in G. We can make use GS must contain an l-clique by Tuŕan’s theorem. Jain and
of a classic peeling algorithm to generate the degeneracy Seshadhri [20] showed that there is a one-to-one mapping
ordering in O(m+n) time [30]. Let δ be the degeneracy of G. between a k -clique in G and an l-clique in GS for a pair
Then, we can easily derive that dv (G) ~ ≤ δ . Since δ is often (S, l) in S . Therefore, to count the number of k -cliques, it is
very small in real-world graphs [21], [29], the degeneracy sufficient to calculate the number of l-cliques in GS for each
ordering based k -clique counting algorithms are often very pair (S, l), which can be efficiently estimated by a weighted
efficient in practice [16]. In this work, we will also use the sampling procedure [20]. In [20], Jain and Seshadhri also
degeneracy ordering to design our algorithms. developed an algorithm with O(α|S| + m) time complexity
to construct the Tuŕan Shadow, where α is the arboricity
2.1 Existing algorithms and their limitations of the graph and |S| = O(nα(k−2) ). Since α is typically
very small in real-life graphs, TuranShadow is efficient to
The PIVOTER algorithm. PIVOTER is the state-of-the-art estimate the k -clique counts. Recently, Jain and Seshadhri
exact k -clique counting algorithm which was proposed by proposed an improved Tuŕan Shadow algorithm, namely
Jain and Seshadhri [17]. The PIVOTER algorithm is based on PEANUTS [27], which can be considered as the state-of-the-
a classic pivoting technique which has been widely used for art sampling-based approximation algorithm. PEANUTS
pruning the search branches in maximal clique enumeration does not construct the Tuŕan Shadow offline. Instead, it
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
Algorithm 2: The Proposed Framework from the sparse regions of the graph. Therefore, to overcome
Input: A graph G = (V, E), an integer k, and the the limitations of both the exact and sampling algorithms,
sample size t we can apply the exact PIVOTER algorithm to calculate the
Output: The number of k-cliques in G k -clique counts in the sparse regions of the graph, and use
1 ~ ← the DAG generated by the degeneracy ordering of
G the sampling-based techniques to estimate the number of
G; k -cliques in the remaining dense regions of the graph. The
2 ans ← 0; S ← ∅; details of our framework is shown in Algorithm 2.
3 foreach v ∈ V do Note that in Algorithm 2, we make use of the average
4 ¯
if d(G(N ~
v (G))) < k then
degree of the nodes in the subgraph C = (VC , EC ) of G,
ans ← ans + PIVOTER(Nv (G), ~ k − 1); ¯ C) = P
5 else S ← S ∪ {v};
denoted by d(V v∈VC dv (C)/|VC |, as an indicator to
measure the sparsity of C . We refer to a subgraph C of G as
6 return ans + Sampling(G,~ S, k, t); a dense subgraph of G if d(V¯ C ) ≥ k (i.e., it lies in the dense
regions of G), otherwise it is called a sparse subgraph. In
Algorithm 2, it first computes a DAG G ~ by the degeneracy
builds a partial Tuŕan Shadow for a sampled node during ordering of G (line 1). Let Nv (G) ~ be the out-neighbors of
the sampling procedure, thus it uses much less space than a node v in G ~ , and G(Nv (G))
~ be the subgraph induced by
the original TuranShadow algorithm. Moreover, PEANUTS Nv (G)~ in G. If the average degree of G(Nv (G)) ~ is smaller
is often much faster than TuranShadow, since it is no need to than k , the algorithm invokes PIVOTER to exactly compute
construct the whole Tuŕan Shadow which takes much time the number of (k − 1)-cliques contained in Nv (G) ~ (line 4).
in the original TuranShadow algorithm. Otherwise, the subgraph G(Nv (G)) ~ is considered as a dense
region of G, and the (k − 1)-cliques contained in Nv (G) ~ are
Limitations of the state-of-the-art algorithms. Although
the PIVOTER algorithm is often very efficient for handling estimated by a sampling algorithm (lines 5-6).
real-life sparse graphs (because real-life graphs often have Let α be the arboricity [12] of the input graph G = (V, E)
a small arboricity), it is still intractable when processing and V 0 be the set of nodes in the sparse region of G. Then,
some hard instances, such as the LiveJournal graph in [17]. we have the following result.
The reason may be that such hard instances often have a Theorem 1. The time √ complexity of Line 4 of Algorithm 2 is
huge number of maximal cliques, thus the succinct clique q d kα+ 1 e
2
tree (SCT) of the PIVOTER algorithm can be very large, O(|V 0 |αd kα + 12 e3 3 ).
rendering the algorithm intractable. TuranShadow is gener-
Proof. For a node v in V 0 , we use the notion αv to de-
ally faster than the exact PIVOTER algorithm for handling
~ , mv to denote the count
note the arboricity of G(Nv (G))
dense graphs with a provably small relative error. However,
~
of edges in G(Nv (G)), and nv to denote the count of
the main limitations of TuranShadow are twofold: (1) it uses
~ . It is easy to derive that nv ≤ δ ≤ 2α
nodes |Nv (G)|
O(nα(k−2) ) space to store the Tuŕan Shadow which is very
costly for large graphs; and (2) it often needs to take much ¯
according to the degeneracy ordering. By d(G(N ~
v (G))) <
time to construct the Tuŕan Shadow for large graphs (the k (Line 4 of Algorithm 2), we can derive that mv =
construction time is O(nα(k−1) + m)). Such two limitations |Nv (G)| ¯
~ × d(G(N ~ ~
v (G))) <qk|Nv (G)| ≤ 2kα. Then, we
√
are alleviated by the improved Tuŕan Shadow algorithm 2mv +nv
have αv ≤ d 2 kα + 21 e [12]. Thus, the total
e ≤ d
PEANUTS. However, on some large graphs, PEANUTS ~ v 3αv /3 ), because the
P
time complexity is O( v∈V 0 |Nv (G)|α
needs to take much time to construct the partial Tuŕan
Shadow and uses considerable space, thus it is still not very time complexity of PIVOTER is O(nα3α/3 ) for a graph
efficient when processing large graphs. with n nodes and arboricity α [17]. As a result, we can
derive that the time complexity
√ 1 of Line 4 of Algorithm 2
q d kα+ e
2
0 1
is O(|V |αd kα + 2 e3 3 ).
3 T HE PROPOSED FRAMEWORK
Note that Theorem 1 shows the time complexity
q of Algo-
In this section, we propose a new algorithmic framework
to estimate the number of k -cliques which combines both rithm 2 in the sparse region of the graph. Since d kα + 21 e
the exact PIVOTER algorithm and the sampling-based al- is smaller than α (because k is usually very small), our
gorithms. The key idea of our framework is based on a framework is efficient on the sparse region of the graphs.
simple but effective observation. The PIVOTER algorithm The remaining question is how can we devise an efficient
often works very efficient in the sparse regions of the graph, and effective sampling algorithm to estimate the number
in which the number of k -cliques is typically not very large. of k -cliques in the dense regions of G. Traditional edge
However, in the dense regions of the graph, PIVOTER may sampling algorithms, such as [18], [31], are often inefficient,
be very costly to compute the k -clique counts, as the dense because those algorithms require a considerable number of
regions of the graph may contain a huge number of k - samples to achieve a desired accuracy [20]. The color-coding
cliques. On the contrary, the sampling-based solutions are based techniques often consume a significant number of
often very efficient and accurate to estimate the number of k - space [19], [23], [26] and also they are less efficient than the
cliques in the dense regions of the graph, but they generally TuranShadow algorithm [20]. The TuranShadow algorithm
perform very bad in the sparse regions of the graph. This is and its variant [20], [27], which are the state-of-the-art
because the k -cliques are relatively easier to be sampled in sampling-based techniques, also need much space to store
the dense regions, but they are often very hard to be drawn the (paritial) Tuŕan Shadow. Moreover, the construction
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
0 1 5 0 1 6
6 3 k
χ 0 2 3 0 2 4 0 1 5 6
1 2 3 4
5 0 4 0 5 6 1 5 6 0 2 3 4
1 1
1 2 2 3 2 0 3 4 2 3 4 0 2 3 6
3 5 8 4
0 3 6 2 3 6
4 7 18 20 8
(a) An example graph (b) The DP table of the k- (c) All 3-color paths. (d) Three 4-color paths and two 4-
color sets triangle paths.
Fig. 1. Illustration of the three proposed color-based sampling techniques
time of the (partial) Tuŕan Shadow is often very long for k -color sets of G, i.e., ρc = cnt k (G,clique)
cntk (G,color) . Intuitively, in the
large graphs, because the worst-case time complexity of dense regions of the graph G, a k -color set is likely to be a k -
TuranShadow is exponential. In Sections 4, 5 and 6, we will clique. Therefore, the k -clique density ρc of the dense region
propose three novel and efficient sampling algorithms to of G is often not very small. As a consequence, an effective
tackle this problem. sampling technique to estimate the number of k -cliques can
Parallel implementation. Note that the proposed frame- be obtained by estimating ρc .
work (Algorithm 2) can be easily parallelized, because the There are two nontrivial problems needed to be tackled
number of k -cliques in the subgraph induced by the out- to develop such a sampling technique. First, we need to
~ is independent. Specifically, in
neighbors for each node in G devise an efficient algorithm to compute the number of k -
lines 3-5 of Algorithm 2, we can process the nodes in the color sets. Second, to estimate ρc , we need to develop a
sparse regions in parallel by independently invoking the uniform sampling mechanism to sample the k -color sets.
PIVOTER algorithms. In the dense regions, the sampling- Below, we will propose a dynamic programming algorithm
based techniques are also easily to be parallelized, because to solve these issues.
we can always draw t independent samples in parallel. In
our experiments, we will show that our parallel implemen- 4.1 DP-based k -color set sampling
tations can achieve a near-linear speedup ratio on real-life Here we first propose a DP algorithm to compute the
graphs. number of k -color sets. Then, we show how to use the DP
algorithm to uniformly sample a k -color set.
4 K - COLOR SET SAMPLING Counting the number of k -color sets. Let χ be the number
In this section, we develop a novel sampling approach to of colors of the graph G obtained by the greedy coloring
estimate the k -clique counts in the dense regions of the algorithm [32], [33]. Denote by ai the number of nodes in G
graph, called k -color set sampling. Our technique is based with the color i ∈ [1, χ]. Let Gi be the subgraph of G that
on a concept of graph coloring [28], [32], [33]. Specifically, only contains the nodes of G with color values no larger
we first color the nodes in a graph such that each pair of than i, i.e., Gi = (Vi , Ei ), where Vi = {v ∈ V |c(v) ≤ i},
adjacent nodes are colored with different colors. Let χ be Ei = {(u, v) ∈ E|u, v ∈ Vi }, and c(v) is the color value of v
the number of colors that are used to color all nodes in the in G. Let F (i, j) be the number of j -color sets in Gi . Then,
graph G. The graph coloring procedure assigns an integer we have the following recursive function for all i, j ∈ [1, χ].
color value taking from [1, · · · , χ] to each node in G, and F (i, j) = ai × F (i − 1, j − 1) + F (i − 1, j). (1)
no two adjacent nodes have the same color value. Note that
since the minimum coloring problem (χ is minimum) is NP- The key idea of Eq. (1) is that the number of j -color sets in
hard [28], we use a linear-time greedy coloring algorithm Gi can be derived by considering two cases: (1) the color
[32], [33] to obtain a feasible coloring solution. Based on a i is included in the j -color sets; and (2) the color i is not
feasible coloring solution, we define a concept called k -color included in the j -color sets. For the first case, the number of
set as follows. j -color sets in Gi is equal to ai times the number of (j − 1)-
color sets in Gi−1 , i.e., ai ×F (i−1, j−1). For the second case,
Definition 1. A set of nodes Vk in the colored graph G is called the number of j -color sets is equal to the number of j -color
a k -color set if it contains k nodes with k different colors. sets in Gi−1 , which is F (i − 1, j). Thus, the total number of
Note that by Definition 1, the nodes of any k -clique j -color sets in Gi is the sum over these two cases. Clearly,
must form a k -color set. In particular, we have the following the number of k -color sets in G is equal to the number of
lemma. k -color sets in Gχ , i.e., F (χ, k). In addition, the initial states
of F (i, j) are as follows:
Lemma 1. Given a graph G, all k -cliques must be contained in
the set of all k -color sets. F (i, 0) = 1, for all i ∈ [0, χ],
(2)
F (i, j) = 0, for all i ∈ [0, χ], j ∈ [i + 1, χ].
Let cntk (G, clique) and cntk (G, color) be the number of
k -cliques and k -color sets of G respectively. Denoted by ρc Based on Eqs. (1) and (2), we can compute the number
the k -clique density of a graph G which is defined as the of k -color sets F (χ, k) in O(kχ) time by dynamic program-
ratio between the number of k -cliques and the number of ming. The detailed implementation of the DP algorithm can
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
Algorithm 3: DPSampler (G, χ, k ) Clearly, the probability that does not choose the color i
Input: A colored graph G = (V, E), an integer k , in Gi is 1 − p(i,j) = F (i − 1, j)/F (i, j). Based on Eq. (3),
and the maximum color number χ we can sample a j -color class using the following recursive
Output: A uniformly sampled k -color set sampling procedure. In each recursion, we pick a color i
1 F ← DPCount(G, χ, k); in Gi with the probability p(i,j) . If the color i is sampled,
we recursively sample the (j − 1)-color class in Gi−1 .
2 p(i,j) ← ai ×FF(i−1,j−1) for all i ∈ [1, χ] and j ∈ [1, k];
(i,j) Otherwise, we recursively sample the j -color class in Gi−1 .
3 R ← DPSampling(G, P, ∅, χ, k); After obtaining a k -color class, a k -color set is generated by
4 return R; randomly selecting a node with each color i in the k -color
5 Procedure DPCount(G, χ, k) class. The detailed implementation of our algorithm for
6 Let ai be the number of nodes with color i in G; uniformly sampling a k -color set is shown in Algorithm 3.
7 F (i, j) ← 0 for all i ∈ [0, χ] and j ∈ [i + 1, k];
Algorithm 3 first invokes the DP procedure to compute
8 foreach i = 0 to χ do F (i, 0) = 1;
F (i, j) for every i ∈ [1, χ] and j ∈ [1, k] (line 1 and
9 foreach i = 1 to χ do
lines 5-12). Then, the algorithm computes the probability
10 for j = 1 to k do
p(i,j) based on Eq. (3) (line 2). After that, the algorithm calls
11 F (i, j) = ai × F (i − 1, j − 1) + F (i − 1, j);
the recursively sampling procedure to uniformly generate
12 return F ; a k -color set (line 3 and lines 13-19). The following results
ensure the correctness of Algorithm 3.
13 Procedure DPSampling(G, P, R, i, j)
14 if j = 0 then return R; Lemma 2. The DPSampling procedure in Algorithm 3 outputs
15 Sampling the color i with probability p(i,j) ; a k -color set of G if χ ≥ k .
16 if the color i is sampled then
17 Randomly choose a node v in G with color i;
18 DPSampling (G, P, R ∪ {v}, i − 1, j − 1);
Proof. On the one hand, it is easy to verify that there are
19 else DPSampling (G, P, R, i − 1, j ) ;
at most k colors outputted by the DPSampling procedure,
since p(i,0) = 0 and DPSampling will terminate immediately
when j = 0. On the other hand, by Eq. (3), we can derive
be found in the DPCount procedure of Algorithm 3 (see that p(i,i) = 1. This is because F (i − 1, i) = 0 by definition,
lines 5-12). thus 1 − p(i,i) = F (i − 1, i)/F (i, i) = 0. As a result, the
probability of sampling a color i with p(i,i) is always 1, thus
From counting to uniformly sampling. Here we propose an
there are at least k colors that are sampled by DPSampling if
efficient approach to uniformly sample a k -color set based
χ ≥ k . Putting it all together, the lemma is established.
on the k -color set counting technique. For convenience, we
refer to a set of k different colors selected from [1, χ] as a k -
color class. Clearly, in a graph G, a k -color class may contain
a set of k -color sets. Theorem 2. Algorithm 3 outputs a uniform k -color set.
To generate a uniform k -color set, a potential method
is that we first sample a k -color class, and then we ran-
domly select a node in G with color i for each i in the Proof. Let X be the event of a random k -color class of G
sampled k -color class. The challenge of this method is that sampled by DPSampling. For each color j from 1 to χ, let Yj
how can we sample the k -color class to guarantee that the be an indicator random variable, which is equal to 1 if the
resulting k -color set is uniformly generated. Obviously, the color j is selected in the event X , otherwise it is equal to
straightforward method that uniformly picks k different 0. Let Pr(X) be the occurrence probability of the event X .
colors from [1, χ] is incorrect in our case. This is because the Then, we have the following equation:
numbers of k -color sets contained in various k -color classes
are different. Thus, uniformly sampling a k -color class from
χ
[1, χ] will introduce biases for generating a uniform k -color X
set. Pr(X) = Pr(( Yi ) = k). (4)
To overcome this challenge, we propose a DP algorithm i=1
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
The following theorem shows the complexity of Algo- 4.2 Estimating the number of k -cliques
rithm 3.
By Theorem 2, we can first make use of Algorithm 3 to
Theorem 3. Suppose that the graph G is colored and the nodes uniformly sample k -color sets from G, and then estimate the
in each color group are obtained. Then, both the time and space clique density ρc in the k -color sets of G. After that, the num-
complexity of Algorithm 3 are O(χk). ber of k -cliques in G can be estimated by ρc ×F (χ, k). Based
on this idea, we propose a weighted sampling algorithm to
Proof. Clearly, the time complexity of the DP procedure estimate the number of cliques in the dense regions of G.
for counting the number of k -color sets is O(χk). In the The detailed implementation of our algorithm is shown in
DPSampling procedure, we can randomly choose a node Algorithm 4.
with color i in constant time if the color groups are obtained Let S be a set of nodes whose neighborhood subgraphs
(line 17). The total time costs of the DPSampling procedure ¯ ~
are dense regions of G, i.e., d(G(N v (G))) ≥ k for each v ∈ S .
are bounded by O(χ+k). As a result, the time complexity of
Algorithm 4 first colors the graph using a linear-time greedy
Algorithm 3 is O(χk). For the space complexity, Algorithm 3
algorithm [32], [33] (line 1). Then, the algorithm invokes the
only requires O(χk) additional space to store the DP table
DPCount procedure to compute the number of k -color sets
F and the probabilities p. for each v ∈ S (lines 3-4). Let cntKCol be the total number
Example 1. Fig. 1(a) is a colored graph with χ = 4. The of k -color sets (line 5). Then, we can obtain a probability
color values of nodes {0, 1, 2, 3, 4, 5, 6} is {1, 2, 2, 3, 4, 3, 4}, distribution D over S where p(v) = Fv (χ, k − 1)/cntKCol
respectively. Clearly, we have a1 = 1 and ai = 2 for i = 2, 3, 4 for each v ∈ S (line 6). After that, Algorithm 4 draws t k -
respectively. Initially, we have F (i, 0) = 1 for all i ∈ [0, 4], and color sets by (1) sampling a node v ∈ S with probability p(v)
F (i, j) = 0 for all i ∈ [0, 4], j ∈ [i + 1, 4]. By Eq. (1), we have (line 9), and (2) uniformly sampling a (k − 1)-color set from
~ (line 10). The algorithm computes the k -clique
G(Nv (G))
F (1, 1) = a1 ×F (0, 0)+F (0, 1) = a1 = 1. F (1, 1) = 1, which
means that there is only one way to choose a vertex with color 1. density ρc in the sampled k -color sets (lines 11-13), and then
Similarly, we get F (2, 1) = a2 × F (1, 0) + F (1, 1) = 3, which estimates the k -clique count as ρc × cntKCol (line 14). The
means that there are 3 different ways to choose a vertex from the following theorem shows that Algorithm 4 can obtain an
vertices with colors 1 and 2. The DP table is shown in Fig. 1(b). unbiased estimator.
Then, we anlayze the probability of sampling the three nodes Theorem 4. Algorithm 4 outputs an unbiased estimator for the
{0, 2, 3} with color 1, 2, 3 respectively. Note that the probability number of k -cliques in the dense regions of G.
F (3,3)
of color 4 not being sampled is 1−p(4,3) = F (4,3) = 15 . Then, the
probability of color 3 being sampled is p(3,3) = 3F (3,3) = 1.
a ×F (2,2) Proof. Let Xi = 1 if the ith sampled k -color set is a k -clique,
otherwise Xi = 0. Observe that
Thus, one vertex with color 3 should be chosen and the probability
of node 3 being sampled is 12 . Likewise, the probability of node 2
X
Pr(Xi = 1) = [Pr(choose v f rom D)
and 0 is 21 and 1, respectively. Finally, the probability of {0, 3, 4} v∈S (7)
1
being sampled is 20 . ~
× Pr(choose a clique f rom G(Nv (G)))].
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
In the summation, the former probability is O((|S| + t)χk + m + n + k 2 t). For the space complexity, the
Fv (χ,k−1)
P ~ , and the latter is exact- algorithm needs to store the graph G and the colors which
v∈S cntk (G(Nv (G)),color)
~
cntk (G(Nv (G)),clique)
takes O(m + n) space in total. Additionally, the algorithm
ly Fv (χ,k−1)P . Consequently, we have uses O(χk) space to store the DP table when sampling a
cnt (G(N ~
(G)),clique)
P r(Xi = 1) = Pv∈S cntk (G(Nv (G)),color) . This implies k -color set. Note that the algorithm does not store all the
~
v∈Sk v DP tables for all samples. Thus, the total space overhead of
that the probability of sampling a k -clique is exactly the
Algorithm 4 is O(m + n + χk).
k -clique density in the dense regions. By the linearity of
expectation, we have
P Remark. The proposed k -color set sampling algorithm is
i≤t Xi completely different from the traditional color coding tech-
E[cntKCol × ]
t P nique [19], [23], [26] for k -clique counting. The color coding
X i≤t E[Xi ] technique randomly assigns a color to each node (it is
= ~ color) ×
cntk (G(Nv (G)), (8)
v∈S
t actually not a valid graph coloring), in which two adjacent
X
~ clique). nodes may have the same color. However, our k -color set
= cntk (G(Nv (G)),
based sampling algorithm is based on the graph coloring
v∈S
technique which requires two adjacent nodes having differ-
Therefore, Algorithm 4 returns an unbiased estimator of the ent colors. For the color coding technique, the probability of
k -clique count in the dense regions of G. each k -clique being colored with k different colors is kk!k [19].
By applying the classic Chernoff bound, we can easily With the increase of k , such a probability decreases dramat-
derive that Algorithm 4 is able to produce a 1 − approx- ically. However, our technique can ensure that the k -clique
imation of the k -clique count in the dense regions of the of G is a k -color set no matter what k is. Moreover, unlike
graph. color coding, the probability of sampling k nodes with k
different colors from G (the colored graph) is nonuniform in
Theorem 5. Algorithm 4 returns a 1 − approximation of the our algorithm.
number of k -cliques in the dense regions of G with probability
1 − 2σ if t ≥ ρc32 ln σ1 , where and σ are small positive values
and t is the sample size. 5 C ONNECTED k - COLOR SET SAMPLING
Proof. Denote by ρˆc the estimator of the k -clique density Recall that to achieve a 1 − approximation, the sample
(line 13 of Algorithm 4). Since our estimator is unbiased, we size of Algorithm 4 heavily relies on the k -clique density
have E[ρˆc ] = ρc . Then, the expected number of k -cliques in over the k -color sets, i.e., ρc (see Theorem 5). Although the
the t samples is E[ρˆc t] = ρc t. Based on the Chernoff bound, dense regions of a graph G often have a relatively high
we easily obtain the following results: ρc , it may still be very small in some cases as the k -color
sets do not fully capture the clique property. To improve the
2 ρ c t 2 ρ c t effectiveness of the sampling algorithm, we propose a novel
Pr(ρˆc t ≤ (1 − )ρc t) ≤ exp(− ) ≤ exp(− ), (9)
2 3 technique which can further boost the k -clique density by
2 ρ c t considering the connectivity of the k -color set.
Pr(ρˆc t ≥ (1 + )ρc t) ≤ exp(− ). (10) A k -color set is definitely not a k -clique if the subgraph
3
induced by the k -color set is not connected. Clearly, such
Further, we have:
disconnected k -color sets are unpromising samples for our
|ρˆc − ρc | 2 ρ c t sampling algorithm. Therefore, to improve the sampling
Pr( ≥ ) ≤ 2 exp(− ). (11)
ρc 3 performance, a natural question is that can we directly
2 sample the connected k -color sets from G? In this section,
Let exp(− 3ρc t ) ≤ σ . Then, we can derive that t ≥ 3 1
ρc 2 ln σ . we answer this question affirmatively by devising a novel
This completes the proof. k -color path sampling technique. The insight is that we only
Note that by Theorem 5, the sample size of our algorithm sample the k -color set in which there exists a simple path
relies on the k -clique density ρc . Since ρc is often not with length k − 1 in the subgraph induced by the k -color
very small in the dense regions of a graph, Algorithm 4 set. For convenience, we refer to such a connected k -color
is expected to be very efficient in practice which is also set as a k -color path.
confirmed in our experiments. Below, we analyze the time Similar to sampling k -color sets in G, we also need
and space complexity of Algorithm 4. to uniformly sample the k -color paths. Unfortunately, the
solutions proposed in Section 4 are no longer applicable for
Theorem 6. Algorithm 4 consumes O((|S|+t)χk+k 2 t+m+n) sampling k -color paths. Below, we develop a new DP-based
time and O(m + n + χk) space. sampling technique to uniformly generate the k -color paths.
Proof. For the time complexity, Algorithm 4 takes O(m + n)
time to obtain a feasible graph coloring. Then, it consumes 5.1 DP-based k -color path sampling
O(|S|χk) time to compute Fv for each v ∈ S . After that,
to draw a k -color set, the algorithm takes O(χk) time and Counting the number of k -color paths. We start by devel-
O(k 2 ) time to check whether it is a clique. Thus, the total oping an algorithm to count the number of k -color paths in
time used in the k -color set sampling stage is O(t(χk + k 2 )). a graph G. We assume that the graph G is colored with the
As a consequence, the time complexity of Algorithm 4 is color values selected from [1, χ]. Based on the color values,
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
we can obtain a color ordering by sorting the nodes in a non- Algorithm 5: DPPathSampler (G, k )
decreasing ordering of their color values. Note that we can Input: A colored graph G = (V, E), and an integer k
use the nodes IDs to break ties to obtain a total ordering. It Output: A uniformly sampled k -color path
is worth mentioning that such a color ordering was used in ~ ← the DAG generated by the color ordering of G;
1 G
the k -clique listing algorithms [16]. Clearly, we are able to ~ k);
~ by the color ordering, where a directed 2 H ← DPPathCount(G,
construct a DAG G
~ is obtained by orienting the direction of 3 R ← DPPathSampling(G, ~ H, k);
edge (u, v) ∈ G
4 return R;
(u, v) ∈ G if v comes after u in the color ordering. Based on
~ , we can obtain the following results. 5 Procedure DPPathCount(G, ~ k)
the DAG G
6 H(vi , j) ← 0, for i ∈ [1, n] and j ∈ [1, k − 1];
Theorem 7. Let G ~ be the DAG generated by the color ordering. 7 foreach i = 0 to n do H(vi , 0) = 1;
~ forms a k -color path.
Then, any (k − 1)-path in G 8 foreach j = 1 to k − 1 do
9 for i = 1 to n do
Proof. Let P = {(v1 , v2 ), (v2 , v3 ), · · · , (vk−1 , vk )} be a (k − 10 ~ do
for vx ∈ Nvi (G)
1)-path in G ~ . By the color ordering, we have c(vi ) ≤ c(vi+1 ) 11 H(vi , j) ← H(vi , j) + H(vx , j − 1);
for every i ∈ [1, k − 1], where c(vi ) denotes the color value
of vi . Since any two adjacent nodes have different colors, we 12 return H ;
have c(vi ) 6= c(vi+1 ) for each i ∈ [1, k − 1]. As a result, the ~ H, k)
13 Procedure DPPathSampling(G,
path S is a k -color path.
14 R ← ∅; Q ← V ;
15 for i = 0 toPk − 1 do
Theorem 8. Let G ~ be the DAG generated by the color ordering.
16 cnt ← u∈Q H(u, k − i − 1) ;
Then, any k -clique C = {v1 , v2 , · · · , vk } in G is a k -color path
17 Set the probability distribution D over the
~.
in G nodes in Q where
p(u) = H(u, k − i − 1)/cnt for each u ∈ Q;
Proof. Let C = {v1 , v2 , · · · , vk } be a k -clique in G. Clearly, 18 Sample a node u from D;
the nodes in C have different colors. Suppose without 19 R ← R ∪ {u}; Q ← Nu (G) ~ ;
~ is
loss of generality that c(v1 ) < c(v2 ), · · · , c(vk ). Since G
generated by the color ordering, there must exist a path 20 return R;
~ which also forms a valid k -
{(v1 , v2 ), · · · , (vk−1 , vk )} in G
color path.
vi in Gvi can be easily obtained. Specifically, we have the
Note that a k -color path in G ~ does not necessarily form
following recursive equation:
a k -clique in G. However, the set of k -color paths is clearly X
a subset of the set of k -color sets. Thus, the k -clique density H(vi , j) = H(vx , j − 1). (13)
over the k -color paths, denoted by ρp , must be no smaller ~v )
vx ∈Nvi (G i
than the k -clique density over the k -color sets.
Initially, we have
Example 2. Reconsider the graph shown in Fig. 1(a). Clearly,
H(vi , 0) = 1, for all i ∈ [1, n],
we have c(0) < c(1) = c(2) < c(3) = c(5) < c(4) = c(6). (14)
H(vi , j) = 0, for all i ∈ [1, n], j ∈ [1, k − 1].
Fig. 1(c) plots all the 3-color paths, and Fig. 1(d) shows all the
4-color paths. The paths with dashed circles are not cliques, while Based on Eqs. (12), (13) and (14), we can easily devise a DP
the others are cliques. We can also easily derive that the 3-clique algorithm to compute cntk−1 (G, ~ path) which is detailed in
6
density is 10 and the 4-clique density is 13 . As expected, the count the DPPathCount procedure of Algorithm 5 (lines 5-12). It is
of k -color paths is much smaller than the count of k -color sets. easy to derive that the time complexity of DPPathCount is
O(knχ), where χ is the maximum color value of G. This is
To estimate the number of k -cliques in G, we need to
because the cardinality of the out-neighbors for any node in
compute ρp and the number of k -color paths as well. Let ~ is bounded by O(χ).
~ v be a subgraph of G ~ induced by {vi , ..., vn }. Denote by G
G i
H(vi , j) the number of j -paths containing the node vi in Sampling a uniform k -color path. Similar to the DP-based
~ v . Clearly, each j -path containing vi in G
G ~ v must start sampling technique developed in Section 4.1, here we also
i i
from vi , since the node vi in G~ v only has out-neighbors. propose a DP-based sampling algorithm to uniformly sam-
i
Thus, the total number of (k − 1)-paths of G ~ , denoted by ple the k -color paths. Suppose without loss of generality that
~ there is a randomly sampled k -color path of G ~ starting from
cntk−1 (G, path), can be computed by the following formula:
a node v , denoted by Pv . Then, for the second node in Pv ,
~ path) =
cntk−1 (G,
X
H(vi , k − 1). (12) it must be an out-neighbor of v in G ~ . According to the DP
~
vi ∈G equation (Eq. (13)), the number of (k−1)-paths starting from
v is equal to the sum of the number of (k − 2)-paths starting
Observe that the second node in each (k − 1)-path from each node in Nv (G) ~ . Therefore, the next node of a
~ v must be an out-neighbor of vi . Thus, if
containing vi in G random k -color path starting from v , denoted by u, should
i
we have the count of the (j − 2)-paths containing vx in G ~v be drawn from Nv (G) ~ with probability H(u,k−2) by Eq. (13).
x H(v,k−1)
~
for each vx ∈ Nvi (Gvi ), the count of (j −1)-paths containing We can recursively perform this sampling procedure to
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
obtain a k -color path. The detailed implementation of this DAG and the DP table H which uses O(nk + m) space in
sampling technique is shown in Algorithm 5. total.
Algorithm 5 first constructs a DAG G ~ by the color
ordering (line 1). Then, the algorithm invokes DPPathCount 5.2 Estimating the k -clique counts
to derive the DP table H (line 2). After that, Algorithm 5 Based on Algorithm 5, we can devise a weighted sampling
calls the DPPathSampling procedure to uniformly sample algorithm to construct an unbiased estimator to compute
a k -color path (line 3). Specifically, when sampling a node the number of k -cliques. Specifically, we can slightly modify
u from Nv (G)~ , DPPathSampling needs to set a probabil-
Algorithm 4 by (1) replacing the DPCount procedure in
ity distribution D over the set Nv (G) ~ based on Eq. (13) line 4 of Algorithm 4 with the DPPathCount procedure, and
(lines 16-18). After choosing a node u, DPPathSampling (2) replacing DPSampling in line 10 of Algorithm 4 with
turns to sample the next node from Nu (G) ~ (line 19). The DPPathSampling. Due to the space limit, we omit the details
DPPathSampling procedure terminate when k nodes are of this modified algorithm. Similar to Theorems 4 and 5,
sampled. the estimator based on the k -color path sampling is also
It is important to note that Algorithm 5 can always unbiased, and the sample size can also be bounded by using
obtain a k -color path if the DAG G ~ contains at least one the Chernoff bound. Moreover, it is easy to check that the
k -color path. This is because in lines 16-18, if a node u sample size is no larger than that of Algorithm 4, because
is sampled, then H(u, k − i − 1) must be larger than 0, ρp ≥ ρc .
indicating that the out-neighborhood Nu (G) ~ must be non- For the time complexity, such a modified algorithm takes
empty. As a consequence, if there is a k -color path in G ~, O(|S|δ 2 k) to compute the DP tables (i.e., H ) for all nodes in
the for loop in line 15 of Algorithm 5 will be executed k S (because the input graph G(Nv (G)) ~ for the DPPathCount
times which results in a k -color path. The following theorem procedure has at most δ nodes), and consumes O(δk + k 2 )
shows that Algorithm 5 can obtain a uniform k -color path. to sample a k -color path. Thus, the total time complexity
of the algorithm is O(|S|δ 2 k + (δk + k 2 )t + m + n), where
Theorem 9. Algorithm 5 outputs a uniform k -color path. O(m + n) is taken for computing the graph coloring. The
space overhead of the modified algorithm is O(m + n + δk),
Proof. Consider a path {v1 , v2 , · · · , vk }. Let X be the event
because the DP table takes O(δk) space.
of this path being sampled by Algorithm 5. Denote by
Yi the event of a node vi appearing in the path. Clear-
ly, the probability of the first node v1 being sampled is 6 COLORFUL TRIANGLE-PATH SAMPLING
Pr(Y1 ) = P H(vH(u,k−1)
1 ,k−1)
. Observe that in the ith -iteration of Note that k -color paths can significantly remove the un-
u∈V
the for loop (line 15), the distribution D for node vi is con- promising k -color sets by introducing a connective con-
structed from Nvi−1 (G) ~ . The node vi being sampled in the straint (i.e., a k -color set must form a path). However, the
for loop can be represented as an event Yi |Yi−1 (conditioned k -color path is still a very sparse structure, which does not
on Yi−1 ), thus we have Pr(Yi |Yi−1 ) = P
H(vi ,k−i)
. fully capture the clique property. Specifically, k -color path
~ H(u,k−i)
u∈Nv
i−1
(G)
only guarantees the existence of k − 1 edges, which is the
As a consequence, we have smallest number of edges to maintain the connectivity. In
Pr(X) = Pr(Y1 ) × Pr(Y2 |Y1 ) × · · · × Pr(Yk |Yk−1 ) this section, we further develop a new technique, called k -
H(v1 , k − 1) H(v2 , k − 2) triangle path, to prune those unpromising k -color paths that
= P ×P × are not the k -cliques. In a simple path, any two consecutive
u∈V H(u, k − 1) ~ H(u, k − 2)
u∈Nv (G) 1 nodes form a 2-clique. Similarly, we define the concept
H(vk , 0) 1 of triangle-path. In a triangle-path, any three consecutive
··· × P = P .
~
u∈Nvk−1 (G) H(u, 0) u∈V H(u, k − 1) vertices form a triangle. When the nodes of a triangle-path
(15) have distinct colors, the triangle-path is called a colorful
Since
P the number of k -color paths in G is equal to triangle-path. In the following, we use k -triangle path to
u∈V H(u, k − 1), each k -color path is sampled uniform- refer to a colorful triangle-path with k nodes.
ly. The k -triangle path can capture the clique property bet-
ter than k -color path, which can further improve the clique
We analyze the time and space complexity of Algorith- density. However, compared to k -color path, k -triangle path
m 5 in the following theorem. is a more complex structure. It is nontrivial to design ef-
Theorem 10. Given an input graph G with n nodes and m ficient algorithms for uniformly sampling k -triangle paths.
edges, Algorithm 5 takes O(χnk + m) time and uses O(kn + m) Below, we propose a new DP algorithm to achieve this goal.
space, where χ is the maximum color value.
6.1 DP-based k -triangle path Sampling
Proof. First, the algorithm consumes O(m + n) time to
As described in Section 5.1, we assume that the graph G is
obtain a DAG. Second, as above analyzed, the DPPathCount
colored with the color values selected from [1, χ]. We can
procedure takes O(nkχ) time. Third, the DPPathSampling ~ based on the color ordering. Below, we
construct a DAG G
procedure uses O(n + χk) time. This is because setting the
formally define the concept of k -triangle path.
probability distribution for the first node takes O(n) time,
while for the other nodes it takes at most O(χ) time. Thus, Definition 2. A k -triangle path is a k -color set with vertices
the total time complexity of Algorithm 5 is O(χnk + m). {v1 , v2 , v3 , ..., vk } where c(vi ) < c(vi+1 ) for all i ∈ [1, k − 1]
For the space complexity, the algorithm needs to store the and vi , vi+1 , vi+2 form a triangle for all i ∈ [1, k − 2].
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
Let G ~ be the DAG generated by the color ordering. Algorithm 6: DPTriSampler (G, k )
Counting the colorful triangle-paths in G is equivalent Input: A colored graph G = (V, E), and an integer k
to counting the triangle-paths in G ~ . Then, any k -clique
Output: A uniformly sampled k -triangle path
C = {v1 , v2 , · · · , vk } in G is a k -triangle path in G ~ . As ~ ← the DAG generated by the color ordering of G;
1 G
described in Section 5.1, the set of k -color paths is a subset ~ k);
2 T ← TriPathCount(G,
of the set of k -color sets. Similarly, the set of k -triangle paths ~ T, k);
is a subset of k -color paths. Thus, the k -clique density over
3 R ← DPTriSampling(G,
4 return R;
the k -triangle paths, denoted by ρt , must be no less than the
5 Procedure TriPathCount(G, ~ k)
k -clique density over the k -color paths.
6 T ((vx , vy ), 2) ← 1, for (vx , vy ) ∈ E ;
Example 3. In Fig. 1(d), the two 4-color paths in the box are k - 7 foreach j = 3 to k do
triangle paths. For example, in the 4-color path {0, 1, 5, 6}, both 8 for (vx , vy ) ∈ E do
{0, 1, 5} and {1, 5, 6} are triangles. However, in the 4-color path 9 for vz ∈ Nvx (G ~ v ) ∩ Nv (G ~ v ) do
x y y
{0, 2, 3, 6}, the three consecutive nodes {2, 3, 6} do not form a 10 T ((vx , vy ), j) ←
triangle, thus {0, 2, 3, 6} is not a k -triangle path. T ((vx , vy ), j) + T ((vy , vz ), j − 1);
Counting the number of k -triangle path. Denote by
11 return T ;
T ((vx , vy ), j) the number of j -triangle-paths with (vx , vy )
as the first edge. Then, the total number of k -color paths of 12 Procedure DPTriSampling(G, ~ T, k)
~ , denoted by cntk (G, triangle), can be computed by the
G 13 Set the probability distribution PD over the edges
following formula: in E where p(e) = T (e, k)/ e∈E T (e, k);
X 14 Sample an edge (vx , vy ) from D;
cntk (G, triangle) = T ((vi , vj ), k). (16) 15 R ← {vx , vy }; Q ← Nvx (G ~ v ) ∩ Nv (G
~ v );
x y y
~
(vi ,vj )∈G 16 for i = 3 toPk do
~ v be a subgraph of G ~ induced by {vi , · · · , vn }. 17 cnt ← vz ∈Q T ((vy , vz ), k − i + 2) ;
Let G i
~ v only has out-neighbors, other 18 Set the probability distribution D over the
Since the node vx in G x
~ v . Denote by vz nodes in Q where
nodes in the k -triangle paths are in G x p(vz ) = T ((vy , vz ), k − i + 2)/cnt for each
the third node in a j -triangle-path. It is easy to see that
vz ∈ Q;
vz is the common neighbor of vx and vy , because the 19 Sample a node vz from D;
three consecutive nodes in a k -triangle paths must form a
20 R ← R ∪ {vz };
triangle. Based on this property, we can derive the following ~ v ) ∩ Nv (G ~ v );
21 Q ← Nv y ( G
equation: y z z
22 vy ← vz ;
X 23 return R;
T ((vx , vy ), j) = T ((vy , vz ), j − 1). (17)
~ v )∩Nv (G
vz ∈Nvx (G ~v )
x y y
Initially, we have
to derive the DP table T and calls the DPTriSampling proce-
T ((vx , vy ), 2) = 1, ∀(vx , vy ) ∈ E. (18) dure to uniformly sample a k -color path (line 3). Based on E-
Based on these equations, we can easily devise a D- q. (17), DPTriSampling sets a probability distribution D over
P algorithm to compute cntk (G, triangle). The detailed the set of edges (line 13) and samples the first two nodes vx
implementation of this DP algorithm is shown in the and vy according to D (lines 14-15). With the first two nodes,
TriPathCount procedure of Algorithm 6 (lines 5-11). Specif- the set of the third nodes is the common out-neighbors of vx
ically, Line 6 initializes the DP table based on Eq. (18), and and vy (line 15). Then, DPTriSampling samples the next node
lines 7-10 is the DP procedure based on Eq. (17). from Q by setting a probability distribution over Q (lines 17-
20). The DPTriSampling procedure terminates when k nodes
Sampling a uniform k -triangle path. Similar to the algo- are sampled.
rithm to uniformly sample the k -color paths, we propose a A k -triangle path can always be obtained by Algorithm 6
DP-based sampling algorithm to uniformly draw k -triangle if the DAG G ~ contains at least one k -triangle path. This
paths. Suppose that there is a randomly selected k -triangle is because in lines 17-20, if a node vz is sampled, then
path, denoted by P . With Eq. (16), we can derive that the T ((vy , vz ), k − i + 2) must be larger than 0, indicating
T ((vx ,vy ),k)
probability of P starting by edge (vx , vy ) is cntk (G,triangle) . that the common out-neighbor of vy and vz must be non-
Then for the third node vz in P , it must be the common out- empty (line 21). As a consequence, the for loop in line 16
neighbor of vx and vy . According to Eq.(17), the number of of Algorithm 6 will be executed k − 2 times which results
k -triangle paths with (vx , vy ) as the first edge is equal to the in a k -triangle path. The following theorem shows that
sum of (k − 1)-triangle paths with (vy , vz ) as the first edge, Algorithm 6 can obtain a uniform k -color path.
T ((v ,v ),k−1)
thus the probability of vz being sampled is T ((vxy ,vyz ),k−2) .
Theorem 11. Algorithm 6 outputs a uniform k -triangle path.
Similar mechanism can be applied to sample the next nodes.
The detailed implementation is shown in Algorithm 6. Proof. Consider a path {v1 , v2 , · · · , vk }. Let X be the event
Algorithm 6 first constructs a DAG G ~ by the color of this path being sampled by Algorithm 6. Denote by Yi
ordering (line 1). Then, the algorithm invokes TriPathCount the event of two nodes vi−1 and vi appearing in the path.
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
Clearly, the probability of the first two nodes v1 and v2 Algorithm 6. Since the estimator is very similar to those
T ((v ,v ),k)
being sampled is Pr(Y1 ) = P 1 T2(e,k) . Observe that in the shown in Section 4.2 and Section 5.2, we omit the details
e∈E
ith -iteration of the for loop (line 16), the nodes in Q are the for brevity.
common out-neighbor of vi−1 and vi−2 . Thus, the event that For the time complexity, such a modified algorithm takes
a node vi is sampled in the ith for loop can be represented O(4S k) to compute the DP tables (i.e., the DP table T in
as Yi |Yi−1 (conditioned on Yi−1 ). We have Pr(Yi |Yi−1 ) = Algorithm 6) for all nodes in S where 4S is the sum of the
P T ((vi−1 ,vi ),k−i+2)
= number of triangle in the dense region S . It also consumes
T ((vi−1 ,u),k−i+2)
u∈Nv
i−1
(Gv~
i−1
)∩Nv
i−2
(Gv~
i−2
)
O(χk +k 2 ) to sample only one k -triangle path, where O(k 2 )
T ((vi−1 ,vi ),k−i+2)
T ((vi−2 ,vi−1 ),k−i+3) . As a result, we have is the time to check whether the sampled k nodes is a k -
clique. Thus, the total time complexity of the algorithm is
Pr(X) = Pr(Y2 ) × Pr(Y3 |Y2 ) × · · · × Pr(Yk |Yk−1 )
O(4S k + (χk + k 2 )t + m + n), where O(m + n) is taken for
T ((v1 , v2 ), k) T ((v2 , v3 ), k − 1)
= P × × computing the graph coloring. The space overhead of the
e∈E T (e, k) T ((v1 , v2 ), k)
modified algorithm is O(δ 2 k), because the DP table takes
T ((vk−1 , vk ), 2) (19) O(m0 k) space where m0 is the maximum number of edges
··· ×
T ((vk−2 , vk−1 ), 3) ~ v and it must satisfy m0 ≤ δ 2 .
for G
1
= P .
e∈E T (e, k)
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
density
DPColor
0.7 DPColorPath 10−3 Networks n m δ
0.6 DPTriPath
10−4
0.5 10 −5 Themaker 69, 413 3,289,686 164
0.4 Stanford 281,903 1,992,636 71
10−6
5 10 15 20 25 5 10 15 20 25 DBLP 425,957 1,049,866 113
k k Google 916,428 4,322,051 44
Skitter 1,696,415 11,095,298 111
(a) Stanford (b) Orkut Orkut 3,072,627 117,185,083 253
Fig. 2. Comparing ρc , ρp , and ρt for different k. LiveJournal 4,036,538 34,681,189 360
Friendster 65,608,366 1,806,067,135 304
Algorithm 7: The Adaptive Sampling Framework
Input: A graph G = (V, E), the dense part S of G, an
101
Relative error(%)
Relative error(%)
integer k, and the error bound 101
100
Output: A (1 − )-approximation of the density of −1
10 10−1
k-cliques.
10−2 PEANUTS PEANUTS
1 C ← 0; T ← 0; 10−3 10−3
DPColor DPColor
2 t ← 10000; 10−4 DPColorPath DPColorPath
10−5
3 threshold ← 32 ln σ1 ; 10−5 DPTriPath DPTriPath
4 while C < threshold do 103 104 105 106 107 108 103 104 105 106 107 108
~ S, k, t); Number of samples Number of samples
5 Sampling(G,
6 T ← T + t; (a) Stanford (b) LiveJournal
7 c ← count of sampled cliques among the t samples; Fig. 4. Relative errors with varying sample size (k = 8)
8 C ← C + c;
9 if C > 0 then t ← threshold/C × T ; accordingly (line 9). The adjusting method in line 9 is a sim-
10 else t ← t × 10; ple yet effective way to make the value of C approaching the
C
11 return T
; threshold. At last, the approximation is returned (line 11).
Instead of time complexity, we analyze the upper bound
of the sampling times of Algorithm 7, i.e. the value of T in
7 A DAPTIVELY D ETERMINING THE S AMPLE S IZE Algorithm 7, which is the key to the running time. We omit
the proof because it is quite clear.
In Algorithm 2, it needs to set the sample size t as a
fixed value. The advantage of a fixed sample size is that Theorem 14. The sampling times of Algorithm 7 is
the running time can be controlled by the parameter t. O(max(104 , ρ32 ln σ1 )).
However, there is no confirmation that the results given
by Algorithm 2 are accurate. To overcome this problem, we The advantage of the new framework is that it can
provide a new framework that can guarantee the accuracy. guarantee the accuracy of the results. The disadvantage is
The key idea of the new framework is based on the that the time complexity of our algorithm depends on the
concept that an estimate is accurate if the number of cliques clique density. Therefore, when k is large (e.g., k > 25), the
in the samples exceeds a threshold. We set the threshold as clique density might be extremely small, resulting in that the
3 1 algorithm requires a large number of samples to achieve a
2 ln σ according to Theorem 13. Theorem 13 explains the
idea more clearly. good accuracy guarantee. In this case, the algorithm may be
costly to obtain a good approximation. Fortunately, for real-
Theorem 13. Suppose that the sample size is t and the number of world applications, k is often not very large (e.g., k < 20),
k -cliques in the t sample size is c. ρ̂ = ct is a 1 − approximation our algorithm is very efficient and extremely fast in practice
of ρ with probability 1 − 2σ if c ≥ 32 ln σ1 . as shown in our experiments. In fact, in subgraph counting
3 1 field, there are no existing algorithms that have both poly-
Proof. Since c = ρ̂t, it has t ≥ ρ̂2 ln σ . Then the theorem can
nomial time complexity and strong accuracy guarantee [35].
be proved by Theorem 5.
Theorem 13 describes that ρ̂ is accurate only if c is large Example 5. To aid understanding, we describe how the adaptive
enough, regardless of the value of t. Based on this idea, sampling framework works on the Orkut network with = 0.05,
we design a new framework that keeps sampling until c is δ = 0.01 and DPPathSampler. The threshold in line 3 is 5519.
larger than the threshold. In the new framework, we utilize The real clique density is 0.0132. At first, the framework samples
the Adaptive Sampling to adapt the sample size according 104 times and get 91 cliques. Now the estimated density is 0.0091
to the existing sampling results. If there are C cliques in T and the error is 0.0132−0.0091 = 0.31, which is larger than .
0.0132
samples already and we needs threshold cliques in total, According to the adaptive sampling method, to let the count of the
the following sample size should be threshold/C × T . sampled cliques larger than threshold, we need threshold/C ×
The details of the new framework is shown in Algorith- T = 606483 more samples (line 9). After sampling, there are
m 7. Algorithm 7 inputs an error bound and returns a 7837 cliques in the 606483 samples. Now there are C = 7837 +
(1 − )-approximation. At first, it samples 103 samples to 91 cliques among the T = 606483 + 10000 samples, and the
test the clique density (line 2). If no clique is sampled, use estimated clique density is 0.0129. The error is 0.0132−0.0129 =
0.0132
more samples to test the clique density (line 10). If there 0.02, which is smaller than .
exists cliques in the T samples, adjust the count of samples
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
Time(s)
Time(s)
Time(s)
102
102 102
102 101
101 101
100
101 100 100
100 10−1
8 9 10 11 12 5 7 9 11 13 15 5 7 9 11 13 15 5 7 9 11 13 15
k k k k
(a) Themaker (b) webStanford (c) DBLP (d) webGoogle
Time(s)
Time(s)
Time(s)
103
102 103
102
101
103
100 102 101
5 7 9 11 13 15 5 7 9 11 13 15 4 5 6 7 8 5 7 9 11 13 15
k k k k
(e) Skitter (f) Orkut (g) LiveJournal (h) Friendster
Fig. 3. Running time of different algorithms (the relative errors for PEANUTS, DPColor, DPColorPath and DPTriPath are set to 0.1%)
3 DPColor DPTriPath PEANUTS DPColorPath The k-clique densities (ρc /ρp /ρt ) in the dense regions (%)
10
DPColor DPTriPath
102 600
101 Networks k=8 k = 15
300
100 Themaker 0.001/0.003/0.02 0.0/0.0/1e-7
Stanford 70.8/77.7/97.3 45.8/53.1/91.7
10−1 0
16 20 24 28 16 20 24 28 DBLP 100.0/100.0/100.0 100.0/100.0/100.0
k k Google 91.2/94.4/96.0 84.7/85.7/85.9
Skitter 5.3/26.4/47.8 0.1/2.3/5.9
(a) The relative error (b) The running time needed to let Orkut 0.0/2.6/22.8 0.0/0.0002/1.4
error lower than 10% LiveJournal 80.4/91.0/95.0 -/-/-
Friendster 0.0/18.1/ 62.9 0.0/52.2/78.0
Fig. 5. The performance on Orkut when k is large
8 E XPERIMENTS
TABLE 3
8.1 Experimental setup Runtime of our parallel algorithms (k = 8, t = 5 × 106 , sec.)
We compare the proposed algorithms with three state-of-
the-art k -clique counting algorithms which are kClist [15], Datasets Algorithms
Threads
[16], PIVOTER [17], TuranShadow [20]. The kClist algorithm 1 4 8 12 16
is an exact k -clique counting algorithm which is based DPColor 24.8 7.1 4.3 2.7 2.1
on k -clique enumeration [15]. Note that the original kClist LiveJournal DPColorPath 28.4 7.5 3.9 2.7 2.1
DPTriPath 142.67 37.98 18.94 12.80 9.81
algorithm is based on the degeneracy ordering. Li et al.
[16] proposed an improved version based on a hybrid DPColor 2481.5 650.3 341.6 244.4 196.5
Friendster DPColorPath 2132.2 559.6 293.3 210.3 171.9
of the degeneracy and color ordering. In our experiment, DPTriPath 2430.32 636.94 336.31 239.68 197.01
kClist denotes such an improved version. PIVOTER and
TuranShadow are the state-of-the-art exact and approximate
networks. DBLP is a co-authorship network, and Skitter
k -clique counting algorithms respectively. Both PIVOTER
is an internet graph. Themaker, Orkut, LiveJournal, and
and TuranShadow were proposed by Jain and Seshadhri [17],
Friendster are social networks. All datasets are downloaded
[20]. PEANUTS [27] is an improved version of TuranShadow
from (snap.stanford.edu) and (https://networkrepository.
which is more efficient than TuranShadow, thus we use
com/networks.php).
PEANUTS as the baseline instead of TuranShadow. The
C++ codes of all these algorithms are publicly available,
thus we use their implementations in our experiments. For 8.2 Experimental results
our algorithms, we implement DPColor, DPColorPath and Exp 1: Runtime of different algorithms. In this experiment,
DPTriPath. The three algorithm are Algorithm 2 integrated we compare the running time of different algorithms on
with three sampling algorithms. All of them are implement- all datasets. Note that for each approximation algorithm
ed in C++. All algorithms are evaluated on a PC with two (PEANUTS, DPColor, DPColorPath and DPTriPath), we
2.1 GHz Xeon CPUs (16 cores in total) and 128GB memory record its running time when the algorithm achieves a
running CentOS 7.6. 0.1% relative error. Here the relative error is computed by
Datasets. We use 8 large real-life datasets in our experi- |f − fˆ|/f , in which f is the exact k -clique count and fˆ is the
ments. Table 1 summarizes the detailed statistic information estimated count. For all algorithms, if they cannot terminate
of all datasets. The last column of Table 1 denotes the within 5 hours, we set their running time to “INF”. Fig. 3
degeneracy of the graph. Stanford and Google are web shows the running time of various algorithms.
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
TABLEq 5 TABLE 7
The maximum clique size/ max δv / d kα + 21 e of the sparse regions The performance of the adaptive sampling technique in Algorithm 7 for
of different graphs various density (δ = 0.01).
DBLP 113 4/7/19 13/16/31 20/23/37 27/30/42 DPColor, 0.01 320358 0.000
Skitter 111 8/11/19 20/23/30 27/30/37 35/38/41 0.86 k = 12, 0.05 7254 0.001
Orkut 253 10/13/28 22/26/45 30/34/56 35/38/62 Google 0.1 2563 0.003
LiveJournal 360 8/11/33 19/22/54 24/28/66 28/31/74 DPColorPath, 0.01 273610 0.000
Friendster 304 21/24/31 47/50/50 50/63/61 60/63/68 0.52 k = 15, 0.05 20550 0.000
Friendster 0.1 10000 0.003
5000 PIVOTER
4000 PEANUTS DPTriPath, 0.01 2464188 0.001
DPColor 0.06 k = 15, 0.05 103032 0.002
3000
DPColorPath Skitter 0.1 33517 0.003
2000 DPTRiPath
DPColor, 0.01 25089143 0.004
1000 0.0067 k = 12, 0.05 1672969 0.002
0 Skitter 0.1 314110 0.009
Themaker com-LiveJournal
DPColorPath, 0.01 1120373294 0.002
0.0002 k = 15, 0.05 38053125 0.009
Fig. 6. Memory usage of various algorithms (k = 8) Orkut 0.1 10222666 0.021
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
TABLE 6
The affect of different threshold to split network. (k = 8, t = 5 × 106 )
(3) the value of error bound in Algorithm 7, (4) the total be used to count k -cliques. Notable example include the
sample size, i.e. the value of T in Algorithm 7, and (5) the color coding based algorithms [23], [26], and edge sampling
estimate error. As shown in Table 7, no matter what the based algorithms [18]. However, as shown in [20], all these
value of density, the estimate error is consistently smaller algorithms cannot scale for large graphs and also their
than the given expected error bound . The value of T tends practical performance is worse than TuranShadow.
to becomes larger when the clique density and the error
bound becomes smaller. These results are consistent with 10 C ONCLUSION
Theorem 13, which confirms that Algorithm 7 can achieve a
In this paper, we propose a time and space efficient frame-
good accuracy guarantee.
work for k -clique counting. Our framework first divides the
Table 8 shows the running time of DPColor, DPColorPath
graph into sparse and dense regions based on the average
and DPTriPath equipped with Algorithm 7 when k = 24.
degree. Then, for the sparse regions, we use the state-of-
The ”INF” means that the adaptive sample size exceed-
the-art PIVOTER algorithm to compute the exact number
s 1010 . In Table 8, DPColor and DPColorPath are faster
of k -cliques. For the dense regions, we develop three novel
than DPTriPath on Stanford and LiveJournal, and slower
DP-based k -color set, k -color path, and k -triangle path sam-
on Skitter and Orkut. This is because the clique density
pling techniques to estimate the k -clique count, respectively.
differs on these datasets. In Table 8, ρt is much larg-
Extensive experiments on 8 real-life graphs show that our
er than ρc and ρp on Skitter and Orkut, and they are
algorithms are very efficient and accurate and also use less
similar on Stanford and LiveJournal. For example, it has
space than the state-of-the-art algorithms.
ρc = 0.00002, ρp = 0.001, ρt = 0.005 on Skitter and
ρc = 0.37, ρp = 0.46, ρt = 0.88 on Stanford. These results
further confirm the analysis in Section 6.3. ACKNOWLEDGMENTS
This work was partially supported by (i) Nation-
9 F URTHER RELATED WORK al Key Research and Development Program of China
2020AAA0108503, (ii) NSFC Grants U2241211, 62072034,
K -clique and triangle counting. Except the practical algo- and (iii) CCF-Huawei Populus Grove Fund.
rithms introduced above, there also exist some theoretical
studies on the k -clique counting problem [37], [38], [39], [40]. R EFERENCES
Most of these theoretical work focus mainly on devising an
algorithm to achieve a better worst-case time complexity. [1] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and
U. Alon, “Network motifs: Simple building blocks of complex
The practical performance of such algorithms is often much networks,” Science, vol. 298, no. 5594, pp. 763–764, 2010.
worse than the state-of-the-art practical algorithms [16]. [2] S. R. Burt, “Structural holes and good ideas,” American Journal of
Triangle is a specific k -clique for k = 3. The problem of Sociology, vol. 110, no. 2, pp. 349–399, 2004.
[3] K. Faust, “A puzzle concerning triads in social networks: Graph
counting triangles in a graph has a long history. There constraints and the triad census,” Soc. Networks, vol. 32, no. 3, pp.
are many algorithms in the literature [31], [41], [42], [43], 221–233, 2010.
[44]. For example, both [41] and [42] are ordering-based [4] N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling interactome:
exact triangle counting algorithms. Chu and Cheng [43] scale-free or geometric?” Bioinform., vol. 20, no. 18, pp. 3508–3515,
2004.
developed an I/O-efficient algorithm exact algorithm for [5] C. Seshadhri and S. Tirthapura, “Scalable subgraph counting: The
triangle listing. Tsourakakis et al. [31] proposed an edge methods behind the madness,” in WWW, 2019.
sampling algorithm to approximate the number of triangles [6] J. W. Berry, B. Hendrickson, R. A. Laviolette, and C. A. Phillips,
“Tolerating the community detection resolution limit with edge
in a graph. Becchetti et al. [44] presented an approximate weighting,” Physical Review E Statistical Nonlinear & Soft Matter
triangle counting algorithm in the semi-streaming model. Physics, vol. 83, no. 5, p. 056119, 2011.
Tom et al. [45] and Hu et al. [46] developed efficient GPU- [7] B. Sun, M. Danisch, T. H. Chan, and M. Sozio, “Kclist++: A simple
parallel algorithms for triangle counting in the shared- algorithm for finding k-clique densest subgraphs in large graphs,”
Proc. VLDB Endow., vol. 13, no. 10, pp. 1628–1640, 2020.
memory many-core platforms. [8] C. E. Tsourakakis, “The k-clique densest subgraph problem,” in
Motif counting. Many exact and sampling-based approxi- WWW, 2015.
[9] A. E. Sariyüce, C. Seshadhri, A. Pinar, and Ü. V. Çatalyürek,
mation algorithms have been proposed for motif counting “Finding the hierarchy of dense subgraphs using nucleus decom-
[18], [23], [26], [35], [47], [48]; and some of them can also positions,” in WWW, 2015.
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643
[10] A. R. Benson, D. F. Gleich, and J. Leskovec, “Higher-order organi- [39] K. Censor-Hillel, Y. Chang, F. L. Gall, and D. Leitersdorf, “Tight
zation of complex networks,” Science, vol. 353, no. 6295, 2016. distributed listing of cliques,” in SODA, 2021.
[11] H. Yin, A. R. Benson, and J. Leskovec, “Higher-order clustering in [40] L. Gianinazzi, M. Besta, Y. Schaffner, and T. Hoefler, “Parallel
networks,” Physical Review E, vol. 97, no. 5, p. 052306, 2017. algorithms for finding large cliques in sparse graphs,” in SPAA,
[12] N. Chiba and T. Nishizeki, “Arboricity and subgraph listing algo- 2021.
rithms,” SIAM J. Comput., vol. 14, no. 1, pp. 210–223, 1985. [41] M. Latapy, “Main-memory triangle computations for very large
[13] I. Finocchi, M. Finocchi, and E. G. Fusco, “Clique counting in (sparse (power-law)) graphs,” Theor. Comput. Sci., vol. 407, no. 1-3,
mapreduce: Algorithms and experiments,” ACM J. Exp. Algorith- pp. 458–473, 2008.
mics, vol. 20, pp. 1.7:1–1.7:20, 2015. [42] M. Ortmann and U. Brandes, “Triangle listing algorithms: Back
[14] K. Makino and T. Uno, “New algorithms for enumerating all max- from the diversion,” in ALENEX, 2014.
imal cliques,” in 9th Scandinavian Workshop on Algorithm Theory, [43] S. Chu and J. Cheng, “Triangle listing in massive networks and its
2004. applications,” in KDD, 2011.
[15] M. Danisch, O. Balalau, and M. Sozio, “Listing k-cliques in sparse [44] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis, “Efficient semi-
real-world graphs,” in WWW, 2018. streaming algorithms for local triangle counting in massive graph-
[16] R. Li, S. Gao, L. Qin, G. Wang, W. Yang, and J. X. Yu, “Ordering s,” in KDD, 2008.
heuristics for k-clique listing,” Proc. VLDB Endow., vol. 13, no. 11, [45] A. S. Tom, N. Sundaram, N. K. Ahmed, S. Smith, S. Eyerman,
pp. 2536–2548, 2020. M. Kodiyath, I. Hur, F. Petrini, and G. Karypis, “Exploring
optimizations on shared-memory platforms for parallel triangle
[17] S. Jain and C. Seshadhri, “The power of pivoting for exact clique
counting algorithms,” in HPEC, 2017.
counting,” in WSDM, 2020.
[46] L. Hu, L. Zou, and Y. Liu, “Accelerating triangle counting on
[18] M. Rahman, M. A. Bhuiyan, and M. A. Hasan, “Graft: An efficient
GPU,” in SIGMOD, 2021.
graphlet counting method for large graph analysis,” IEEE Trans.
[47] N. Pashanasangi and C. Seshadhri, “Efficiently counting vertex
Knowl. Data Eng., vol. 26, no. 10, pp. 2466–2478, 2014.
orbits of all 5-vertex subgraphs, by EVOKE,” in WSDM, 2020.
[19] N. Alon, R. Yuster, and U. Zwick, “Color-coding: a new method [48] A. Pinar, C. Seshadhri, and V. Vishal, “ESCAPE: efficiently count-
for finding simple paths, cycles and other small subgraphs within ing all 5-vertex subgraphs,” in WWW, 2017.
large graphs,” in STOC, 1994.
[20] S. Jain and C. Seshadhri, “A fast and provable method for estimat- Xiaowei Ye received the BE degree from Shan-
ing clique counts using turán’s theorem,” in WWW, 2017. dong University, China, in 2021, and is working
[21] D. W. Matula and L. L. Beck, “Smallest-last ordering and clustering toward the PhD degree at Beijing Institute of
and graph coloring algorithms,” J. ACM, vol. 30, no. 3, pp. 417– Technology (BIT), Beijing, China. His research
427, 1983. interests include subgraph counting, graph data
[22] E. Tomita, A. Tanaka, and H. Takahashi, “The worst-case time mining and social network analysis.
complexity for generating all maximal cliques and computational
experiments,” Theor. Comput. Sci., vol. 363, no. 1, pp. 28–42, 2006.
[23] M. Bressan, S. Leucci, and A. Panconesi, “Motivo: Fast motif
counting via succinct color coding and adaptive sampling,” Proc.
VLDB Endow., vol. 12, no. 11, pp. 1651–1663, 2019. Rong-Hua Li received the PhD degree from the
[24] M. Jha, C. Seshadhri, and A. Pinar, “Path sampling: A fast and Chinese University of Hong Kong, in 2013. He
provable method for estimating 4-vertex subgraph counts,” in is currently a professor with the Beijing Institute
WWW, 2015. of Technology (BIT), Beijing, China. Before join-
[25] P. Wang, J. Zhao, X. Zhang, Z. Li, J. Cheng, J. C. S. Lui, D. Towsley, ing BIT in 2018, he was an assistant professor
J. Tao, and X. Guan, “MOSS-5: A fast method of approximating with Shenzhen University. His research interest-
counts of 5-node graphlets in large graphs,” IEEE Trans. Knowl. s include graph data management and mining,
Data Eng., vol. 30, no. 1, pp. 73–86, 2018. social network analysis, graph computation sys-
[26] M. Bressan, F. Chierichetti, R. Kumar, S. Leucci, and A. Panconesi, tems, and graph-based machine learning.
“Motif counting beyond five nodes,” ACM Trans. Knowl. Discov.
Data, vol. 12, no. 4, pp. 48:1–48:25, 2018. Qiangqiang Dai is working toward the PhD de-
[27] S. Jain and C. Seshadhri, “Provably and efficiently approximating gree at Beijing Institute of Technology (BIT), Bei-
near-cliques using the turán shadow: PEANUTS,” in WWW ’20: jing, China. His research interests include graph
The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020, 2020, pp. data management and mining, social network
1966–1976. analysis, and graph computation systems.
[28] B. Balasundaram and S. Butenko, “Graph domination, coloring
and cliques in telecommunications,” in Handbook of Optimization in
Telecommunications. Springer, 2006, pp. 865–890.
[29] L. Chang and L. Qin, “Cohesive subgraph computation over large
sparse graphs,” in ICDE, 2019.
[30] V. Batagelj and M. Zaversnik, “An o(m) algorithm for cores de- Hongzhi Chen received his Ph.D. degree from
composition of networks,” CoRR, vol. cs.DS/0310049, 2003. the Department of Computer Science and Engi-
[31] C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos, neering, the Chinese University of Hong Kong, in
“DOULION: counting triangles in massive graphs with a coin,” 2020. He is currently a senior R.D. at ByteDance
in KDD, 2009. Infrastructure Team, Beijing, China, working on
[32] W. Hasenplaugh, T. Kaler, T. B. Schardl, and C. E. Leiserson, graph related storage, processing and training
“Ordering heuristics for parallel graph coloring,” in SPAA, 2014. systems. His research interests cover the broad
[33] L. Yuan, L. Qin, X. Lin, L. Chang, and W. Zhang, “Effective and area of distributed systems and databases, with
efficient dynamic graph coloring,” Proc. VLDB Endow., vol. 11, special emphasis on graph systems and ma-
no. 3, pp. 338–351, 2017. chine learning/deep learning systems.
[34] L. Li, “Discrete distributions,” 1972.
Guoren Wang received the BS, MS, and PhD
[35] P. Ribeiro, P. Paredes, M. E. P. Silva, D. Aparı́cio, and F. M. A. degrees from the Department of Computer Sci-
Silva, “A survey on subgraph counting: Concepts, algorithms, ence, Northeastern University, China, in 1988,
and applications to network motifs and graphlets,” ACM Comput. 1991, and 1996, respectively. Currently, he is a
Surv., vol. 54, no. 2, pp. 28:1–28:36, 2022. professor with the Beijing Institute of Technolo-
[36] M. Almasri, I. E. Hajj, R. Nagi, J. Xiong, and W. Hwu, “Parallel gy (BIT), Beijing, China. His research interest-
k-clique counting on gpus,” in ICS, 2022. s include graph data management and mining,
[37] T. Eden, D. Ron, and C. Seshadhri, “On approximating the number query processing and optimization, graph com-
of k-cliques in sublinear time,” in STOC, 2018. putation systems.
[38] ——, “Faster sublinear approximation of the number of k-cliques
in low-arboricity graphs,” in SODA, 2020.
Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.