0% found this document useful (0 votes)
59 views18 pages

Efficient K-Clique Counting On Large Graphs The Power of Color-Based Sampling Approaches

This article proposes new algorithms for efficiently estimating the number of k-cliques in large graphs. It develops three novel dynamic programming based k-color set sampling techniques that use a small number of samples to accurately estimate k-clique counts. Experimental results show the best algorithm is over an order of magnitude faster than state-of-the-art sampling methods and up to three orders of magnitude faster than exact algorithms on large graphs, while achieving high accuracy.

Uploaded by

Rashmika Gamage
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views18 pages

Efficient K-Clique Counting On Large Graphs The Power of Color-Based Sampling Approaches

This article proposes new algorithms for efficiently estimating the number of k-cliques in large graphs. It develops three novel dynamic programming based k-color set sampling techniques that use a small number of samples to accurately estimate k-clique counts. Experimental results show the best algorithm is over an order of magnitude faster than state-of-the-art sampling methods and up to three orders of magnitude faster than exact algorithms on large graphs, while achieving high accuracy.

Uploaded by

Rashmika Gamage
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Efficient k -clique Counting on Large Graphs: the


Power of Color-based Sampling Approaches
Xiaowei Ye, Rong-Hua Li, Qiangqiang Dai, Hongzhi Chen, and Guoren Wang

Abstract—K -clique counting is a fundamental problem in network analysis which has attracted much attention in recent years.
Computing the count of k-cliques in a graph for a large k (e.g., k = 8) is often intractable as the number of k-cliques increases
exponentially w.r.t. (with respect to) k. Existing exact k-clique counting algorithms are often hard to handle large dense graphs, while
sampling-based solutions either require a huge number of samples or consume very high storage space to achieve a satisfactory
accuracy. To overcome these limitations, we propose a new framework to estimate the number of k-cliques which integrates both the
exact k-clique counting technique and three novel color-based sampling techniques. The key insight of our framework is that we only
apply the exact algorithm to compute the k-clique counts in the sparse regions of a graph, and use the proposed color-based sampling
approaches to estimate the number of k-cliques in the dense regions of the graph. Specifically, we develop three novel dynamic
programming based k-color set sampling techniques to efficiently estimate the k-clique counts, where a k-color set contains k nodes
with k different colors. Since a k-color set is often a good approximation of a k-clique in the dense regions of a graph, our
sampling-based solutions are extremely efficient and accurate. Moreover, the proposed sampling techniques are space efficient which
use near-linear space w.r.t. graph size. We conduct extensive experiments to evaluate our algorithms using 8 real-life graphs. The
results show that our best algorithm is at least one order of magnitude faster than the state-of-the-art sampling-based solutions (with
the same relative error 0.1%) and can be up to three orders of magnitude faster than the state-of-the-art exact algorithm on large
graphs.

Index Terms—k-clique counting, Cohesive subgraphs, Graph coloring, Graph sampling, Dynamic programming

1 I NTRODUCTION

R EAL - LIFE networks, such as social networks, we-


b graphs, and biological networks, often contain
frequently-occurring small subgraph structures. Such fre-
algorithms can be classified into (1) exact k -clique counting
methods, and (2) sampling-based approximation solutions.
Chiba and Nishizeki [12] developed the first exact k -clique
quent small subgraphs are referred to as network motifs [1]. counting algorithm based on k -clique enumeration which
Counting the motifs is a fundamental tool in many network is very efficient on real-life sparse graphs for a small k .
analysis applications, including social network analysis, Such an algorithm was recently improved by Finocchi et
community detection, and bioinformatics [1], [2], [3], [4], [5]. al. [13] based on a degree ordering technique. Subsequently,
Perhaps the most elementary motif in a graph is the k -clique Danisch et al. [15] further improved this algorithm by using
which has been widely used in a variety of network analysis a degeneracy ordering technique [21]. More recently, Li et
applications [1], [2], [6], [7], [8]. al. [16] developed a further improved algorithm based on
Given a graph G, a k -clique is a complete subgraph of G a hybrid of degeneracy and color ordering technique. All
with k nodes. Counting the k -cliques in a graph has found these exact k -clique counting algorithms are based on k -
many important applications in dense subgraph mining and clique enumeration, which are typically intractable on large
social network analysis. For example, Sariyüce et al. [9] graphs for a large k (e.g., k ≥ 8) due to combinatorial explo-
proposed a nucleus decomposition method to find the hier- sion. To overcome this issue, Jain and Seshadhri developed
archy of dense subgraphs, which uses the k -clique counting an elegant algorithm, called PIVOTER, based on a classic
operator as a basic building block. Tsourakakis [8] studied pivoting technique which was widely used for pruning the
a k -clique densest subgraph problem which also uses the k - search branches in maximal clique enumeration [22]. The
clique counting operator as a building block. Additionally, key idea of PIVOTER is that it can implicitly construct a
the k -clique counting operator has also been applied to succinct clique tree (SCT) by using the pivoting technique
detect higher-order organizations in social networks [10], in the search procedure. Such a SCT structure maintains a
[11]. unique representation of all k -cliques, but its size is much
Motivated by the above applications, many practical k - smaller than the number of k -cliques. PIVOTER was shown
clique counting algorithms have been proposed [12], [13], to be much faster than previous k -clique enumeration based
[14], [15], [16], [17], [18], [19], [20]. Existing k -clique counting algorithms [12], [13], [14], [15], [16].
Although PIVOTER is often very efficient for handling
real-life sparse graphs, it may still have a very deep recur-
• Xiaowei Ye, Rong-Hua Li, Qiangqiang Dai and GuorenWang are sion tree when processing the dense regions of the graph,
with the Beijing Institute of Technology, Beijing 100811, China. E-
mail:[email protected], [email protected], [email protected], which is the main bottleneck of the PIVOTER algorithm.
[email protected]. Moreover, PIVOTER is based on the idea of enumeration of
• Hongzhi Chen is with the ByteDance, Beijing 100811, China. E- large cliques (not necessary maximal cliques) to count the
mail:[email protected].
k -cliques. It is often not very fast on the dense regions of the

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2

graph, because the dense regions of the graph may contain • We propose a new algorithmic framework for esti-
many large cliques (with complicated overlap relationship- mating k -clique counting which can circumvent the
s), resulting in a large search tree of PIVOTER (e.g., see the defects of the existing exact and approximation al-
results on the LiveJournal dataset in [17]). gorithms. We show that our framework is extremely
Approximation solutions based on sampling are typ- efficient and accurate. It can achieve a 10−5 relative
ically able to handle large dense graphs when k is not error by sampling a reasonable number of samples.
very large [18], [20], [23]. However, to achieve a desired • We develop three novel DP-based k -color set sam-
accuracy, previous sampling-based solutions either require pling techniques to estimate the number of k -cliques
a huge number of samples [18], [24], [25] or consume very in the dense regions of the graph. Our novelty
high storage space [19], [20], [23], [26], [27] for a relatively is in the algorithmic use of classic graph coloring
large k (e.g., k ≥ 8). Among them, a notable sampling- technique for sampling. The striking features of our
based approximation algorithm is the TuranShadow algo- techniques are that they are not only very efficient
rithm which was proposed by Jain and Seshadhri [20]. As and accurate, but also use near-linear space w.r.t. the
shown in [20], TuranShadow is much faster and more ac- graph size.
curate than the other previous sampling-based algorithms. • We evaluate our algorithms on 8 large real-life graph-
The main limitation of TuranShadow is that it needs to s. The results show that (1) our best algorithm is
take O(nα(k−1) + m) time and O(nα(k−2) ) space to con- at least one order of magnitude faster than the
struct a data structure called Tuŕan Shadow for sampling, state-of-the-art approximate algorithm (PEANUTS)
where α denotes the arboricity of the graph [12]. Therefore, to achieve a 0.1% relative error, using much smaller
on large graphs, TuranShadow is very costly for a large space; and (2) it can be up to three orders of magni-
k . To reduce the space usage of TuranShadow, the same tude faster than the state-of-the-art exact algorithm
authors developed an improved TuranShadow algorithm (PIVOTER) on large graphs. For example, on the
called PEANUTS. PEANUTS adopts an online sampling hardest dataset LiveJournal with k = 8, TuranShadow
solution which does not construct the Tuŕan Shadow offline. takes more than 120 seconds and PIVOTER cannot
However, PEANUTS still needs to build a partial Tuŕan terminate within 5 hours, while our best algorith-
Shadow when estimating the k -clique counts of a sampled m consumes around 20 seconds to achieve a 0.1%
node, which sometimes consumes a lot of space. relative error. Moreover, our algorithms also exhibit
To overcome the limitations of the state-of-the-art algo- an excellent parallel performance which can achieve
rithms, we propose a new framework to estimate the num- 12× ∼ 14× speedup ratios when using 16 threads in
ber of k -cliques in a graph which integrates both the exact our experiments.
PIVOTER algorithm and two newly-developed sampling-
Reproducibility. For reproducibility purpose, the source
based techniques. Our framework is based on a simple
code of this paper is released at https://github.com/
but effective observation: PIVOTER is extremely efficient
LightWant/dpcolor.
to compute the number of k -cliques in the sparse regions of
the graph, while sampling-based solutions are often very Organization. The rest of this paper is organized as follows.
efficient and accurate to estimate the k -clique counts in the In Section 2, we describe several key notations, formulate
dense regions of a graph. Base on this crucial observation, we the problem, summarize several representative existing al-
can first partition the graph into sparse and dense regions. gorithms of k -clique counting, and also analyze the defects
Then, for the sparse regions, we invoke PIVOTER to exactly of these algorithms. In Section 3, we propose a novel sam-
compute the k -clique counts. For the dense regions, we pling framework for k -clique counting. In Section 4, we
propose three novel sampling techniques based on a con- present the DP-based k -color set sampling algorithm. The k -
cept of graph coloring [28] to estimate the k -clique counts. color path and k -triangle path algorithms are developed in
Specifically, we first present a new concept called k -color set Section 5 and Section 6 respectively. Extensive experiments
which denotes a set of k nodes with k different colors. Then, are shown in Section 8. Finally, we survey the related work
we propose a dynamic programming (DP) based k -color set in Section 9 and conclude this work in Section 10.
sampling algorithm to estimate the k -clique counts. Since a
k -color set is typically a good approximation for a k -clique 2 P RELIMINARIES
in the dense regions of a graph, our algorithm is extremely
efficient and accurate. In addition, we also propose a novel Let G = (V, E) be an undirected graph, where V and E
DP-based k -color path sampling and a novel DP-based k - denotes the set of nodes and edges respectively. Let n and m
triangle path sampling techniques to further improve the be the number of nodes and edges of G respectively. Denote
efficiency and accuracy. Here a k -color path is a connected by Nv (G) the set of neighbors of v in G. The degree of v ,
k -color set and a k -triangle path is a k -color path with denoted by dv (G), is the size of the neighbor set of v , i.e.,
any three consecutive nodes forming a triangle. These two dv (G) = |Nv (G)|. Given a subset S of V , we denote by
new concepts are more effective to approximate a k -clique G(S) = (VS , ES ) the subgraph of G induced by S , where
than the k -color set. Moreover, unlike TuranShadow and ES = {(u, v) ∈ E|u, v ∈ S}. A k -clique is a complete
PEANUTS, all of our sampling-based solutions take near- subgraph of G in which every pair of nodes is connected
linear space w.r.t. the graph size. by an edge.
Given a graph G and an integer k , the k -clique counting
Contributions. In summary, the main contributions of this problem is to compute the number of k -cliques in G. Prac-
paper are as follows. tical algorithms for solving the k -clique counting problem

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Algorithm 1: The PIVOTER Algorithm [17] [22]. The key idea of PIVOTER is that it implicitly builds a
Input: A graph G = (V, E) and an integer k succinct clique tree (SCT) by using the pivoting technique
Output: The number of k-cliques in G in the search procedure. Such a SCT structure maintains a
1 ~ ← the DAG generated by the degeneracy ordering of
G unique representation of all k -cliques, but its size is often
G; much smaller than the number of k -cliques. PIVOTER was
2 ans ← 0; shown to be much faster than the traditional k -clique listing
3 for u ∈ V do based algorithms [15], [16], [17]. Since we will make use of
4 ~ k − 1, 0, 0);
PIVOTER(Nu (G), PIVOTER as a subroutine in our algorithms, we give the
5 return ans; detailed description of PIVOTER in Algorithm 1.
6 Procedure PIVOTER(S, k, p, h) Algorithm 1 first computes a DAG G ~ of G based on the
7 if h > k return; degeneracy ordering (line 1). Then, for each node u ∈ V ,
8 if S = ∅ then
p
 the algorithm invokes the PIVOTER procedure to calculate
9 ans ← ans + k−h ;
the number of (k − 1)-cliques in Nu (G) ~ (lines 3-4). In the
10 return;
PIVOTER procedure, it first selects a node with the maxi-
11 pv ← maxu∈S {|Nu (G)~ ∩ S|};
mum number of neighbors in S as a pivot node pv (line 11).
12 PIVOTER(Npv (G) ~ ∩ S, k, p + 1, h);
The candidate set S is then divided into three subsets: {pv},
13 U ← S − Npv (G)~ − {pv}; ~ ∩ S and S − {pv} − Npv (G) ~ . By these three subsets,
Npv (G)
14 for vi ∈ U do
~ ∩ S, k, p, h + 1); the cliques can be classified into three various types: (1) the
15 PIVOTER(Nvi (G) ~ ∩ S,
16 S ← S − {vi }; k -cliques containing nodes in both {pv} and Npv (G)
(2) the k -cliques only containing nodes in Npv (G)~ ∩ S , and
(3) the k -cliques containing nodes in S − {pv} − Npv (G) ~ .
Then, PIVOTER recursively computes the total numbers for
are often based on some ordering-based heuristic techniques these three types of k -cliques (lines 12-16). Note that the
[15], [16], [17], [20]. first two types of k -cliques can be counted by invoking
Let π : V → {v1 , ..., vn } be a total order of the nodes PIVOTER with the input set Npv (G) ~ ∩ S (line 12), whereas
in G. For two nodes u and v of G, we say that π(v) < the last type of k -cliques are iteratively counted for each
π(u) if u comes after v in the ordering of π . Then, based node in S − {pv} − Npv (G) ~ (lines 14-16). The worst-case
on such an ordering, we can obtain a DAG (directed acyclic time complexity of PIVOTER is O(nα3α/3 ) where α is the
graph) G ~ by orienting the edges of the undirected graph G. arboricity [12] of the graph and δ/2 ≤ α ≤ δ . Since α is
Specifically, for each undirected edge (u, v) in G, we obtain often very small in real-life sparse graphs, the PIVOTER
a directed edge (u, v) in G ~ if π(u) < π(v), otherwise we algorithm was shown to be very efficient in practice [17].
get a directed edge (v, u). The k -clique counting problem in The TuranShadow algorithm and its variant. TuranShadow
G is equivalent to computing the number of k -cliques in G ~. is a representative sampling-based approximation algorith-
Existing k -clique counting algorithms that work on the DAG m which was also proposed by Jain and Seshadhri [20].
G~ (instead of the original graph G) can guarantee that each As shown in [20], TuranShadow is much faster and more
k -clique is only explored once, thus significantly improving accurate than the other sampling-based algorithms. The
the efficiency. TuranShadow algorithm first constructs a data structure,
Note that many different ordering heuristics for k -clique called Tuŕan Shadow, based on the classic Tuŕan’s theorem
counting have been developed in the literature [16]. Among which states that a graph must contain a k -cliuqe if the edge
them, a widely-used ordering heuristics is the degeneracy density ρ = m/ n2 satisfies ρ > 1−1/(k−1). Specifically, the
ordering [21], where the degeneracy is a metric to measure Tuŕan Shadow, denoted by S , contains a set of pairs (S, l)
the sparsity of a graph [29]. Specifically, the degeneracy where S is a node set and l ≤ k is an integer. Let GS be the
ordering of nodes in G is defined as an ordering {v1 , ..., vn } subgraph induced by the node set S . For each pairs (S, l),
such that the degree of vi is minimum in the subgraph of G the edge density of GS is larger than 1 − 1/(l − 1), thus
induced by {vi , ..., vn } for each vi in G. We can make use GS must contain an l-clique by Tuŕan’s theorem. Jain and
of a classic peeling algorithm to generate the degeneracy Seshadhri [20] showed that there is a one-to-one mapping
ordering in O(m+n) time [30]. Let δ be the degeneracy of G. between a k -clique in G and an l-clique in GS for a pair
Then, we can easily derive that dv (G) ~ ≤ δ . Since δ is often (S, l) in S . Therefore, to count the number of k -cliques, it is
very small in real-world graphs [21], [29], the degeneracy sufficient to calculate the number of l-cliques in GS for each
ordering based k -clique counting algorithms are often very pair (S, l), which can be efficiently estimated by a weighted
efficient in practice [16]. In this work, we will also use the sampling procedure [20]. In [20], Jain and Seshadhri also
degeneracy ordering to design our algorithms. developed an algorithm with O(α|S| + m) time complexity
to construct the Tuŕan Shadow, where α is the arboricity
2.1 Existing algorithms and their limitations of the graph and |S| = O(nα(k−2) ). Since α is typically
very small in real-life graphs, TuranShadow is efficient to
The PIVOTER algorithm. PIVOTER is the state-of-the-art estimate the k -clique counts. Recently, Jain and Seshadhri
exact k -clique counting algorithm which was proposed by proposed an improved Tuŕan Shadow algorithm, namely
Jain and Seshadhri [17]. The PIVOTER algorithm is based on PEANUTS [27], which can be considered as the state-of-the-
a classic pivoting technique which has been widely used for art sampling-based approximation algorithm. PEANUTS
pruning the search branches in maximal clique enumeration does not construct the Tuŕan Shadow offline. Instead, it

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4

Algorithm 2: The Proposed Framework from the sparse regions of the graph. Therefore, to overcome
Input: A graph G = (V, E), an integer k, and the the limitations of both the exact and sampling algorithms,
sample size t we can apply the exact PIVOTER algorithm to calculate the
Output: The number of k-cliques in G k -clique counts in the sparse regions of the graph, and use
1 ~ ← the DAG generated by the degeneracy ordering of
G the sampling-based techniques to estimate the number of
G; k -cliques in the remaining dense regions of the graph. The
2 ans ← 0; S ← ∅; details of our framework is shown in Algorithm 2.
3 foreach v ∈ V do Note that in Algorithm 2, we make use of the average
4 ¯
if d(G(N ~
v (G))) < k then
degree of the nodes in the subgraph C = (VC , EC ) of G,
ans ← ans + PIVOTER(Nv (G), ~ k − 1); ¯ C) = P
5 else S ← S ∪ {v};
denoted by d(V v∈VC dv (C)/|VC |, as an indicator to
measure the sparsity of C . We refer to a subgraph C of G as
6 return ans + Sampling(G,~ S, k, t); a dense subgraph of G if d(V¯ C ) ≥ k (i.e., it lies in the dense
regions of G), otherwise it is called a sparse subgraph. In
Algorithm 2, it first computes a DAG G ~ by the degeneracy
builds a partial Tuŕan Shadow for a sampled node during ordering of G (line 1). Let Nv (G) ~ be the out-neighbors of
the sampling procedure, thus it uses much less space than a node v in G ~ , and G(Nv (G))
~ be the subgraph induced by
the original TuranShadow algorithm. Moreover, PEANUTS Nv (G)~ in G. If the average degree of G(Nv (G)) ~ is smaller
is often much faster than TuranShadow, since it is no need to than k , the algorithm invokes PIVOTER to exactly compute
construct the whole Tuŕan Shadow which takes much time the number of (k − 1)-cliques contained in Nv (G) ~ (line 4).
in the original TuranShadow algorithm. Otherwise, the subgraph G(Nv (G)) ~ is considered as a dense
region of G, and the (k − 1)-cliques contained in Nv (G) ~ are
Limitations of the state-of-the-art algorithms. Although
the PIVOTER algorithm is often very efficient for handling estimated by a sampling algorithm (lines 5-6).
real-life sparse graphs (because real-life graphs often have Let α be the arboricity [12] of the input graph G = (V, E)
a small arboricity), it is still intractable when processing and V 0 be the set of nodes in the sparse region of G. Then,
some hard instances, such as the LiveJournal graph in [17]. we have the following result.
The reason may be that such hard instances often have a Theorem 1. The time √ complexity of Line 4 of Algorithm 2 is
huge number of maximal cliques, thus the succinct clique q d kα+ 1 e
2
tree (SCT) of the PIVOTER algorithm can be very large, O(|V 0 |αd kα + 12 e3 3 ).
rendering the algorithm intractable. TuranShadow is gener-
Proof. For a node v in V 0 , we use the notion αv to de-
ally faster than the exact PIVOTER algorithm for handling
~ , mv to denote the count
note the arboricity of G(Nv (G))
dense graphs with a provably small relative error. However,
~
of edges in G(Nv (G)), and nv to denote the count of
the main limitations of TuranShadow are twofold: (1) it uses
~ . It is easy to derive that nv ≤ δ ≤ 2α
nodes |Nv (G)|
O(nα(k−2) ) space to store the Tuŕan Shadow which is very
costly for large graphs; and (2) it often needs to take much ¯
according to the degeneracy ordering. By d(G(N ~
v (G))) <
time to construct the Tuŕan Shadow for large graphs (the k (Line 4 of Algorithm 2), we can derive that mv =
construction time is O(nα(k−1) + m)). Such two limitations |Nv (G)| ¯
~ × d(G(N ~ ~
v (G))) <qk|Nv (G)| ≤ 2kα. Then, we

are alleviated by the improved Tuŕan Shadow algorithm 2mv +nv
have αv ≤ d 2 kα + 21 e [12]. Thus, the total
e ≤ d
PEANUTS. However, on some large graphs, PEANUTS ~ v 3αv /3 ), because the
P
time complexity is O( v∈V 0 |Nv (G)|α
needs to take much time to construct the partial Tuŕan
Shadow and uses considerable space, thus it is still not very time complexity of PIVOTER is O(nα3α/3 ) for a graph
efficient when processing large graphs. with n nodes and arboricity α [17]. As a result, we can
derive that the time complexity
√ 1 of Line 4 of Algorithm 2
q d kα+ e
2
0 1
is O(|V |αd kα + 2 e3 3 ).
3 T HE PROPOSED FRAMEWORK
Note that Theorem 1 shows the time complexity
q of Algo-
In this section, we propose a new algorithmic framework
to estimate the number of k -cliques which combines both rithm 2 in the sparse region of the graph. Since d kα + 21 e
the exact PIVOTER algorithm and the sampling-based al- is smaller than α (because k is usually very small), our
gorithms. The key idea of our framework is based on a framework is efficient on the sparse region of the graphs.
simple but effective observation. The PIVOTER algorithm The remaining question is how can we devise an efficient
often works very efficient in the sparse regions of the graph, and effective sampling algorithm to estimate the number
in which the number of k -cliques is typically not very large. of k -cliques in the dense regions of G. Traditional edge
However, in the dense regions of the graph, PIVOTER may sampling algorithms, such as [18], [31], are often inefficient,
be very costly to compute the k -clique counts, as the dense because those algorithms require a considerable number of
regions of the graph may contain a huge number of k - samples to achieve a desired accuracy [20]. The color-coding
cliques. On the contrary, the sampling-based solutions are based techniques often consume a significant number of
often very efficient and accurate to estimate the number of k - space [19], [23], [26] and also they are less efficient than the
cliques in the dense regions of the graph, but they generally TuranShadow algorithm [20]. The TuranShadow algorithm
perform very bad in the sparse regions of the graph. This is and its variant [20], [27], which are the state-of-the-art
because the k -cliques are relatively easier to be sampled in sampling-based techniques, also need much space to store
the dense regions, but they are often very hard to be drawn the (paritial) Tuŕan Shadow. Moreover, the construction

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

0 1 5 0 1 6
6 3 k
χ 0 2 3 0 2 4 0 1 5 6
1 2 3 4
5 0 4 0 5 6 1 5 6 0 2 3 4
1 1
1 2 2 3 2 0 3 4 2 3 4 0 2 3 6
3 5 8 4
0 3 6 2 3 6
4 7 18 20 8
(a) An example graph (b) The DP table of the k- (c) All 3-color paths. (d) Three 4-color paths and two 4-
color sets triangle paths.
Fig. 1. Illustration of the three proposed color-based sampling techniques

time of the (partial) Tuŕan Shadow is often very long for k -color sets of G, i.e., ρc = cnt k (G,clique)
cntk (G,color) . Intuitively, in the
large graphs, because the worst-case time complexity of dense regions of the graph G, a k -color set is likely to be a k -
TuranShadow is exponential. In Sections 4, 5 and 6, we will clique. Therefore, the k -clique density ρc of the dense region
propose three novel and efficient sampling algorithms to of G is often not very small. As a consequence, an effective
tackle this problem. sampling technique to estimate the number of k -cliques can
Parallel implementation. Note that the proposed frame- be obtained by estimating ρc .
work (Algorithm 2) can be easily parallelized, because the There are two nontrivial problems needed to be tackled
number of k -cliques in the subgraph induced by the out- to develop such a sampling technique. First, we need to
~ is independent. Specifically, in
neighbors for each node in G devise an efficient algorithm to compute the number of k -
lines 3-5 of Algorithm 2, we can process the nodes in the color sets. Second, to estimate ρc , we need to develop a
sparse regions in parallel by independently invoking the uniform sampling mechanism to sample the k -color sets.
PIVOTER algorithms. In the dense regions, the sampling- Below, we will propose a dynamic programming algorithm
based techniques are also easily to be parallelized, because to solve these issues.
we can always draw t independent samples in parallel. In
our experiments, we will show that our parallel implemen- 4.1 DP-based k -color set sampling
tations can achieve a near-linear speedup ratio on real-life Here we first propose a DP algorithm to compute the
graphs. number of k -color sets. Then, we show how to use the DP
algorithm to uniformly sample a k -color set.
4 K - COLOR SET SAMPLING Counting the number of k -color sets. Let χ be the number
In this section, we develop a novel sampling approach to of colors of the graph G obtained by the greedy coloring
estimate the k -clique counts in the dense regions of the algorithm [32], [33]. Denote by ai the number of nodes in G
graph, called k -color set sampling. Our technique is based with the color i ∈ [1, χ]. Let Gi be the subgraph of G that
on a concept of graph coloring [28], [32], [33]. Specifically, only contains the nodes of G with color values no larger
we first color the nodes in a graph such that each pair of than i, i.e., Gi = (Vi , Ei ), where Vi = {v ∈ V |c(v) ≤ i},
adjacent nodes are colored with different colors. Let χ be Ei = {(u, v) ∈ E|u, v ∈ Vi }, and c(v) is the color value of v
the number of colors that are used to color all nodes in the in G. Let F (i, j) be the number of j -color sets in Gi . Then,
graph G. The graph coloring procedure assigns an integer we have the following recursive function for all i, j ∈ [1, χ].
color value taking from [1, · · · , χ] to each node in G, and F (i, j) = ai × F (i − 1, j − 1) + F (i − 1, j). (1)
no two adjacent nodes have the same color value. Note that
since the minimum coloring problem (χ is minimum) is NP- The key idea of Eq. (1) is that the number of j -color sets in
hard [28], we use a linear-time greedy coloring algorithm Gi can be derived by considering two cases: (1) the color
[32], [33] to obtain a feasible coloring solution. Based on a i is included in the j -color sets; and (2) the color i is not
feasible coloring solution, we define a concept called k -color included in the j -color sets. For the first case, the number of
set as follows. j -color sets in Gi is equal to ai times the number of (j − 1)-
color sets in Gi−1 , i.e., ai ×F (i−1, j−1). For the second case,
Definition 1. A set of nodes Vk in the colored graph G is called the number of j -color sets is equal to the number of j -color
a k -color set if it contains k nodes with k different colors. sets in Gi−1 , which is F (i − 1, j). Thus, the total number of
Note that by Definition 1, the nodes of any k -clique j -color sets in Gi is the sum over these two cases. Clearly,
must form a k -color set. In particular, we have the following the number of k -color sets in G is equal to the number of
lemma. k -color sets in Gχ , i.e., F (χ, k). In addition, the initial states
of F (i, j) are as follows:
Lemma 1. Given a graph G, all k -cliques must be contained in 
the set of all k -color sets. F (i, 0) = 1, for all i ∈ [0, χ],
(2)
F (i, j) = 0, for all i ∈ [0, χ], j ∈ [i + 1, χ].
Let cntk (G, clique) and cntk (G, color) be the number of
k -cliques and k -color sets of G respectively. Denoted by ρc Based on Eqs. (1) and (2), we can compute the number
the k -clique density of a graph G which is defined as the of k -color sets F (χ, k) in O(kχ) time by dynamic program-
ratio between the number of k -cliques and the number of ming. The detailed implementation of the DP algorithm can

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

Algorithm 3: DPSampler (G, χ, k ) Clearly, the probability that does not choose the color i
Input: A colored graph G = (V, E), an integer k , in Gi is 1 − p(i,j) = F (i − 1, j)/F (i, j). Based on Eq. (3),
and the maximum color number χ we can sample a j -color class using the following recursive
Output: A uniformly sampled k -color set sampling procedure. In each recursion, we pick a color i
1 F ← DPCount(G, χ, k); in Gi with the probability p(i,j) . If the color i is sampled,
we recursively sample the (j − 1)-color class in Gi−1 .
2 p(i,j) ← ai ×FF(i−1,j−1) for all i ∈ [1, χ] and j ∈ [1, k];
(i,j) Otherwise, we recursively sample the j -color class in Gi−1 .
3 R ← DPSampling(G, P, ∅, χ, k); After obtaining a k -color class, a k -color set is generated by
4 return R; randomly selecting a node with each color i in the k -color
5 Procedure DPCount(G, χ, k) class. The detailed implementation of our algorithm for
6 Let ai be the number of nodes with color i in G; uniformly sampling a k -color set is shown in Algorithm 3.
7 F (i, j) ← 0 for all i ∈ [0, χ] and j ∈ [i + 1, k];
Algorithm 3 first invokes the DP procedure to compute
8 foreach i = 0 to χ do F (i, 0) = 1;
F (i, j) for every i ∈ [1, χ] and j ∈ [1, k] (line 1 and
9 foreach i = 1 to χ do
lines 5-12). Then, the algorithm computes the probability
10 for j = 1 to k do
p(i,j) based on Eq. (3) (line 2). After that, the algorithm calls
11 F (i, j) = ai × F (i − 1, j − 1) + F (i − 1, j);
the recursively sampling procedure to uniformly generate
12 return F ; a k -color set (line 3 and lines 13-19). The following results
ensure the correctness of Algorithm 3.
13 Procedure DPSampling(G, P, R, i, j)
14 if j = 0 then return R; Lemma 2. The DPSampling procedure in Algorithm 3 outputs
15 Sampling the color i with probability p(i,j) ; a k -color set of G if χ ≥ k .
16 if the color i is sampled then
17 Randomly choose a node v in G with color i;
18 DPSampling (G, P, R ∪ {v}, i − 1, j − 1);
Proof. On the one hand, it is easy to verify that there are
19 else DPSampling (G, P, R, i − 1, j ) ;
at most k colors outputted by the DPSampling procedure,
since p(i,0) = 0 and DPSampling will terminate immediately
when j = 0. On the other hand, by Eq. (3), we can derive
be found in the DPCount procedure of Algorithm 3 (see that p(i,i) = 1. This is because F (i − 1, i) = 0 by definition,
lines 5-12). thus 1 − p(i,i) = F (i − 1, i)/F (i, i) = 0. As a result, the
probability of sampling a color i with p(i,i) is always 1, thus
From counting to uniformly sampling. Here we propose an
there are at least k colors that are sampled by DPSampling if
efficient approach to uniformly sample a k -color set based
χ ≥ k . Putting it all together, the lemma is established.
on the k -color set counting technique. For convenience, we
refer to a set of k different colors selected from [1, χ] as a k -
color class. Clearly, in a graph G, a k -color class may contain
a set of k -color sets. Theorem 2. Algorithm 3 outputs a uniform k -color set.
To generate a uniform k -color set, a potential method
is that we first sample a k -color class, and then we ran-
domly select a node in G with color i for each i in the Proof. Let X be the event of a random k -color class of G
sampled k -color class. The challenge of this method is that sampled by DPSampling. For each color j from 1 to χ, let Yj
how can we sample the k -color class to guarantee that the be an indicator random variable, which is equal to 1 if the
resulting k -color set is uniformly generated. Obviously, the color j is selected in the event X , otherwise it is equal to
straightforward method that uniformly picks k different 0. Let Pr(X) be the occurrence probability of the event X .
colors from [1, χ] is incorrect in our case. This is because the Then, we have the following equation:
numbers of k -color sets contained in various k -color classes
are different. Thus, uniformly sampling a k -color class from
χ
[1, χ] will introduce biases for generating a uniform k -color X
set. Pr(X) = Pr(( Yi ) = k). (4)
To overcome this challenge, we propose a DP algorithm i=1

to sample a k -color class which can guarantee that the


resulting k -color set is uniformly drawn. In particular, given Recall that DPSampling draws k colors following the de-
a j -color class in Gi , it either (1) contains the color i, or (2) creasing order of the color values (i.e., from χ to 1). For
does not contain the color i. If the first case is true, the other each color j ∈ [1, χ], the probability of selecting the color
j − 1 colors of the j -color class are selected from [1, i − 1] in j in Gi is p(i,j) . Assume that the sampled k -color class
Gi−1 . However, for the second case, the j -color class must of G is C = {c1 , ..., ck }, where each ci is a color value
be selected from [1, i−1] in Gi−1 . Therefore, we can sample a of G and c1 > c2 > · · · > ck . Clearly, a k -color class C
k -color class in G based on a similar DP equation as shown partitions the interval [1, χ] into at most 2k + 1 sub-intervals
in Eq. (1). More specifically, to sample a j -color class, we as {[c1 + 1, χ], [c1 , c1 ], [c2 + 1, c1 − 1], · · · , [ck + 1, ck−1 −
define the probability of selecting the color i in Gi as 1], [ck , ck ], [1, ck − 1]}. Note that DPSampling only selects a
ai × F (i − 1, j − 1) color in the sub-intervals [ci , ci ] for every i = 1, · · · , k , and
p(i,j) = . (3) no color is selected in the other sub-intervals. Therefore, the
F (i, j)

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7



probability of Pr(( i=1 Yi ) = k) can be computed by Algorithm 4: Estimating the number of k -cliques by
k -color set sampling
F (χ − 1, k) F (χ − 2, k) F (c1 , k)
× × ··· × ~ , a node set S , an integer k , and the
F (χ, k) F (χ − 1, k) F (c1 + 1, k) Input: A graph G
ac × F (c1 − 1, k − 1) F (c1 − 2, k − 1) sample size t
× 1 × × ···
F (c1 , k) F (c1 − 1, k − 1) Output: An estimation of the number of k -cliques
F (c2 , k − 1) ac × F (c2 − 1, k − 2) 1 Coloring the graph G ~ using a linear-time algorithm
× × 2 (5)
F (c2 + 1, k − 1) F (c2 , k − 1) [32], [33];
F (ck , 1) ack × F (ck − 1, 0) 2 Let χ be the number of colors obtained;
× ··· × ×
F (ck + 1, 1) F (ck , 1) 3 foreach v ∈ S do
ac1 × ac2 × · · · × ack 4 Fv ← DPCount(G(Nv (G)), ~ χ, k − 1);
= .
F (χ, k) P
5 cntKCol ← v∈S Fv (χ, k − 1);
After obtaining a k -color class C , the algorithm further 6 Set the probability distribution D over the nodes in
samples k nodes with k different colors in C from G. Let S where p(v) = Fv (χ, k − 1)/cntKCol for each
Pr(k -color set) be the probability of sampling a k -color set v ∈ S;
from G. Then, we have 7 successT imes ← 0;
8 for i = 1 to t do
Pr(k-color set)
9 Independently sample a node v from D;
= Pr(k nodes with different colors|X) × Pr(X) ~ χ, k − 1);
10 R ← {v} ∪ DPSampler(G(Nv (G)),
1 ac × ac2 × · · · × ack
= × 1 (6) 11 if R is a k -clique then
ac1 × ac2 × · · · × ack F (χ, k)
12 successT imes ← successT imes + 1;
1 1
= =
F (χ, k) cntk (G, color) 13 ρk ← successT imes/t;
By Eq. (6), each k -color set is uniformly sampled, thus the 14 return ρk × cntKCol;
theorem is established.

The following theorem shows the complexity of Algo- 4.2 Estimating the number of k -cliques
rithm 3.
By Theorem 2, we can first make use of Algorithm 3 to
Theorem 3. Suppose that the graph G is colored and the nodes uniformly sample k -color sets from G, and then estimate the
in each color group are obtained. Then, both the time and space clique density ρc in the k -color sets of G. After that, the num-
complexity of Algorithm 3 are O(χk). ber of k -cliques in G can be estimated by ρc ×F (χ, k). Based
on this idea, we propose a weighted sampling algorithm to
Proof. Clearly, the time complexity of the DP procedure estimate the number of cliques in the dense regions of G.
for counting the number of k -color sets is O(χk). In the The detailed implementation of our algorithm is shown in
DPSampling procedure, we can randomly choose a node Algorithm 4.
with color i in constant time if the color groups are obtained Let S be a set of nodes whose neighborhood subgraphs
(line 17). The total time costs of the DPSampling procedure ¯ ~
are dense regions of G, i.e., d(G(N v (G))) ≥ k for each v ∈ S .
are bounded by O(χ+k). As a result, the time complexity of
Algorithm 4 first colors the graph using a linear-time greedy
Algorithm 3 is O(χk). For the space complexity, Algorithm 3
algorithm [32], [33] (line 1). Then, the algorithm invokes the
only requires O(χk) additional space to store the DP table
DPCount procedure to compute the number of k -color sets
F and the probabilities p. for each v ∈ S (lines 3-4). Let cntKCol be the total number
Example 1. Fig. 1(a) is a colored graph with χ = 4. The of k -color sets (line 5). Then, we can obtain a probability
color values of nodes {0, 1, 2, 3, 4, 5, 6} is {1, 2, 2, 3, 4, 3, 4}, distribution D over S where p(v) = Fv (χ, k − 1)/cntKCol
respectively. Clearly, we have a1 = 1 and ai = 2 for i = 2, 3, 4 for each v ∈ S (line 6). After that, Algorithm 4 draws t k -
respectively. Initially, we have F (i, 0) = 1 for all i ∈ [0, 4], and color sets by (1) sampling a node v ∈ S with probability p(v)
F (i, j) = 0 for all i ∈ [0, 4], j ∈ [i + 1, 4]. By Eq. (1), we have (line 9), and (2) uniformly sampling a (k − 1)-color set from
~ (line 10). The algorithm computes the k -clique
G(Nv (G))
F (1, 1) = a1 ×F (0, 0)+F (0, 1) = a1 = 1. F (1, 1) = 1, which
means that there is only one way to choose a vertex with color 1. density ρc in the sampled k -color sets (lines 11-13), and then
Similarly, we get F (2, 1) = a2 × F (1, 0) + F (1, 1) = 3, which estimates the k -clique count as ρc × cntKCol (line 14). The
means that there are 3 different ways to choose a vertex from the following theorem shows that Algorithm 4 can obtain an
vertices with colors 1 and 2. The DP table is shown in Fig. 1(b). unbiased estimator.
Then, we anlayze the probability of sampling the three nodes Theorem 4. Algorithm 4 outputs an unbiased estimator for the
{0, 2, 3} with color 1, 2, 3 respectively. Note that the probability number of k -cliques in the dense regions of G.
F (3,3)
of color 4 not being sampled is 1−p(4,3) = F (4,3) = 15 . Then, the
probability of color 3 being sampled is p(3,3) = 3F (3,3) = 1.
a ×F (2,2) Proof. Let Xi = 1 if the ith sampled k -color set is a k -clique,
otherwise Xi = 0. Observe that
Thus, one vertex with color 3 should be chosen and the probability
of node 3 being sampled is 12 . Likewise, the probability of node 2
X
Pr(Xi = 1) = [Pr(choose v f rom D)
and 0 is 21 and 1, respectively. Finally, the probability of {0, 3, 4} v∈S (7)
1
being sampled is 20 . ~
× Pr(choose a clique f rom G(Nv (G)))].

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

In the summation, the former probability is O((|S| + t)χk + m + n + k 2 t). For the space complexity, the
Fv (χ,k−1)
P ~ , and the latter is exact- algorithm needs to store the graph G and the colors which
v∈S cntk (G(Nv (G)),color)
~
cntk (G(Nv (G)),clique)
takes O(m + n) space in total. Additionally, the algorithm
ly Fv (χ,k−1)P . Consequently, we have uses O(χk) space to store the DP table when sampling a
cnt (G(N ~
(G)),clique)
P r(Xi = 1) = Pv∈S cntk (G(Nv (G)),color) . This implies k -color set. Note that the algorithm does not store all the
~
v∈Sk v DP tables for all samples. Thus, the total space overhead of
that the probability of sampling a k -clique is exactly the
Algorithm 4 is O(m + n + χk).
k -clique density in the dense regions. By the linearity of
expectation, we have
P Remark. The proposed k -color set sampling algorithm is
i≤t Xi completely different from the traditional color coding tech-
E[cntKCol × ]
t P nique [19], [23], [26] for k -clique counting. The color coding
X i≤t E[Xi ] technique randomly assigns a color to each node (it is
= ~ color) ×
cntk (G(Nv (G)), (8)
v∈S
t actually not a valid graph coloring), in which two adjacent
X
~ clique). nodes may have the same color. However, our k -color set
= cntk (G(Nv (G)),
based sampling algorithm is based on the graph coloring
v∈S
technique which requires two adjacent nodes having differ-
Therefore, Algorithm 4 returns an unbiased estimator of the ent colors. For the color coding technique, the probability of
k -clique count in the dense regions of G. each k -clique being colored with k different colors is kk!k [19].
By applying the classic Chernoff bound, we can easily With the increase of k , such a probability decreases dramat-
derive that Algorithm 4 is able to produce a 1 −  approx- ically. However, our technique can ensure that the k -clique
imation of the k -clique count in the dense regions of the of G is a k -color set no matter what k is. Moreover, unlike
graph. color coding, the probability of sampling k nodes with k
different colors from G (the colored graph) is nonuniform in
Theorem 5. Algorithm 4 returns a 1 −  approximation of the our algorithm.
number of k -cliques in the dense regions of G with probability
1 − 2σ if t ≥ ρc32 ln σ1 , where  and σ are small positive values
and t is the sample size. 5 C ONNECTED k - COLOR SET SAMPLING
Proof. Denote by ρˆc the estimator of the k -clique density Recall that to achieve a 1 −  approximation, the sample
(line 13 of Algorithm 4). Since our estimator is unbiased, we size of Algorithm 4 heavily relies on the k -clique density
have E[ρˆc ] = ρc . Then, the expected number of k -cliques in over the k -color sets, i.e., ρc (see Theorem 5). Although the
the t samples is E[ρˆc t] = ρc t. Based on the Chernoff bound, dense regions of a graph G often have a relatively high
we easily obtain the following results: ρc , it may still be very small in some cases as the k -color
sets do not fully capture the clique property. To improve the
2 ρ c t 2 ρ c t effectiveness of the sampling algorithm, we propose a novel
Pr(ρˆc t ≤ (1 − )ρc t) ≤ exp(− ) ≤ exp(− ), (9)
2 3 technique which can further boost the k -clique density by
2 ρ c t considering the connectivity of the k -color set.
Pr(ρˆc t ≥ (1 + )ρc t) ≤ exp(− ). (10) A k -color set is definitely not a k -clique if the subgraph
3
induced by the k -color set is not connected. Clearly, such
Further, we have:
disconnected k -color sets are unpromising samples for our
|ρˆc − ρc | 2 ρ c t sampling algorithm. Therefore, to improve the sampling
Pr( ≥ ) ≤ 2 exp(− ). (11)
ρc 3 performance, a natural question is that can we directly
2 sample the connected k -color sets from G? In this section,
Let exp(−  3ρc t ) ≤ σ . Then, we can derive that t ≥ 3 1
ρc 2 ln σ . we answer this question affirmatively by devising a novel
This completes the proof. k -color path sampling technique. The insight is that we only
Note that by Theorem 5, the sample size of our algorithm sample the k -color set in which there exists a simple path
relies on the k -clique density ρc . Since ρc is often not with length k − 1 in the subgraph induced by the k -color
very small in the dense regions of a graph, Algorithm 4 set. For convenience, we refer to such a connected k -color
is expected to be very efficient in practice which is also set as a k -color path.
confirmed in our experiments. Below, we analyze the time Similar to sampling k -color sets in G, we also need
and space complexity of Algorithm 4. to uniformly sample the k -color paths. Unfortunately, the
solutions proposed in Section 4 are no longer applicable for
Theorem 6. Algorithm 4 consumes O((|S|+t)χk+k 2 t+m+n) sampling k -color paths. Below, we develop a new DP-based
time and O(m + n + χk) space. sampling technique to uniformly generate the k -color paths.
Proof. For the time complexity, Algorithm 4 takes O(m + n)
time to obtain a feasible graph coloring. Then, it consumes 5.1 DP-based k -color path sampling
O(|S|χk) time to compute Fv for each v ∈ S . After that,
to draw a k -color set, the algorithm takes O(χk) time and Counting the number of k -color paths. We start by devel-
O(k 2 ) time to check whether it is a clique. Thus, the total oping an algorithm to count the number of k -color paths in
time used in the k -color set sampling stage is O(t(χk + k 2 )). a graph G. We assume that the graph G is colored with the
As a consequence, the time complexity of Algorithm 4 is color values selected from [1, χ]. Based on the color values,

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

we can obtain a color ordering by sorting the nodes in a non- Algorithm 5: DPPathSampler (G, k )
decreasing ordering of their color values. Note that we can Input: A colored graph G = (V, E), and an integer k
use the nodes IDs to break ties to obtain a total ordering. It Output: A uniformly sampled k -color path
is worth mentioning that such a color ordering was used in ~ ← the DAG generated by the color ordering of G;
1 G
the k -clique listing algorithms [16]. Clearly, we are able to ~ k);
~ by the color ordering, where a directed 2 H ← DPPathCount(G,
construct a DAG G
~ is obtained by orienting the direction of 3 R ← DPPathSampling(G, ~ H, k);
edge (u, v) ∈ G
4 return R;
(u, v) ∈ G if v comes after u in the color ordering. Based on
~ , we can obtain the following results. 5 Procedure DPPathCount(G, ~ k)
the DAG G
6 H(vi , j) ← 0, for i ∈ [1, n] and j ∈ [1, k − 1];
Theorem 7. Let G ~ be the DAG generated by the color ordering. 7 foreach i = 0 to n do H(vi , 0) = 1;
~ forms a k -color path.
Then, any (k − 1)-path in G 8 foreach j = 1 to k − 1 do
9 for i = 1 to n do
Proof. Let P = {(v1 , v2 ), (v2 , v3 ), · · · , (vk−1 , vk )} be a (k − 10 ~ do
for vx ∈ Nvi (G)
1)-path in G ~ . By the color ordering, we have c(vi ) ≤ c(vi+1 ) 11 H(vi , j) ← H(vi , j) + H(vx , j − 1);
for every i ∈ [1, k − 1], where c(vi ) denotes the color value
of vi . Since any two adjacent nodes have different colors, we 12 return H ;
have c(vi ) 6= c(vi+1 ) for each i ∈ [1, k − 1]. As a result, the ~ H, k)
13 Procedure DPPathSampling(G,
path S is a k -color path.
14 R ← ∅; Q ← V ;
15 for i = 0 toPk − 1 do
Theorem 8. Let G ~ be the DAG generated by the color ordering.
16 cnt ← u∈Q H(u, k − i − 1) ;
Then, any k -clique C = {v1 , v2 , · · · , vk } in G is a k -color path
17 Set the probability distribution D over the
~.
in G nodes in Q where
p(u) = H(u, k − i − 1)/cnt for each u ∈ Q;
Proof. Let C = {v1 , v2 , · · · , vk } be a k -clique in G. Clearly, 18 Sample a node u from D;
the nodes in C have different colors. Suppose without 19 R ← R ∪ {u}; Q ← Nu (G) ~ ;
~ is
loss of generality that c(v1 ) < c(v2 ), · · · , c(vk ). Since G
generated by the color ordering, there must exist a path 20 return R;
~ which also forms a valid k -
{(v1 , v2 ), · · · , (vk−1 , vk )} in G
color path.
vi in Gvi can be easily obtained. Specifically, we have the
Note that a k -color path in G ~ does not necessarily form
following recursive equation:
a k -clique in G. However, the set of k -color paths is clearly X
a subset of the set of k -color sets. Thus, the k -clique density H(vi , j) = H(vx , j − 1). (13)
over the k -color paths, denoted by ρp , must be no smaller ~v )
vx ∈Nvi (G i
than the k -clique density over the k -color sets.
Initially, we have
Example 2. Reconsider the graph shown in Fig. 1(a). Clearly, 
H(vi , 0) = 1, for all i ∈ [1, n],
we have c(0) < c(1) = c(2) < c(3) = c(5) < c(4) = c(6). (14)
H(vi , j) = 0, for all i ∈ [1, n], j ∈ [1, k − 1].
Fig. 1(c) plots all the 3-color paths, and Fig. 1(d) shows all the
4-color paths. The paths with dashed circles are not cliques, while Based on Eqs. (12), (13) and (14), we can easily devise a DP
the others are cliques. We can also easily derive that the 3-clique algorithm to compute cntk−1 (G, ~ path) which is detailed in
6
density is 10 and the 4-clique density is 13 . As expected, the count the DPPathCount procedure of Algorithm 5 (lines 5-12). It is
of k -color paths is much smaller than the count of k -color sets. easy to derive that the time complexity of DPPathCount is
O(knχ), where χ is the maximum color value of G. This is
To estimate the number of k -cliques in G, we need to
because the cardinality of the out-neighbors for any node in
compute ρp and the number of k -color paths as well. Let ~ is bounded by O(χ).
~ v be a subgraph of G ~ induced by {vi , ..., vn }. Denote by G
G i
H(vi , j) the number of j -paths containing the node vi in Sampling a uniform k -color path. Similar to the DP-based
~ v . Clearly, each j -path containing vi in G
G ~ v must start sampling technique developed in Section 4.1, here we also
i i
from vi , since the node vi in G~ v only has out-neighbors. propose a DP-based sampling algorithm to uniformly sam-
i
Thus, the total number of (k − 1)-paths of G ~ , denoted by ple the k -color paths. Suppose without loss of generality that
~ there is a randomly sampled k -color path of G ~ starting from
cntk−1 (G, path), can be computed by the following formula:
a node v , denoted by Pv . Then, for the second node in Pv ,
~ path) =
cntk−1 (G,
X
H(vi , k − 1). (12) it must be an out-neighbor of v in G ~ . According to the DP
~
vi ∈G equation (Eq. (13)), the number of (k−1)-paths starting from
v is equal to the sum of the number of (k − 2)-paths starting
Observe that the second node in each (k − 1)-path from each node in Nv (G) ~ . Therefore, the next node of a
~ v must be an out-neighbor of vi . Thus, if
containing vi in G random k -color path starting from v , denoted by u, should
i
we have the count of the (j − 2)-paths containing vx in G ~v be drawn from Nv (G) ~ with probability H(u,k−2) by Eq. (13).
x H(v,k−1)
~
for each vx ∈ Nvi (Gvi ), the count of (j −1)-paths containing We can recursively perform this sampling procedure to

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10

obtain a k -color path. The detailed implementation of this DAG and the DP table H which uses O(nk + m) space in
sampling technique is shown in Algorithm 5. total.
Algorithm 5 first constructs a DAG G ~ by the color
ordering (line 1). Then, the algorithm invokes DPPathCount 5.2 Estimating the k -clique counts
to derive the DP table H (line 2). After that, Algorithm 5 Based on Algorithm 5, we can devise a weighted sampling
calls the DPPathSampling procedure to uniformly sample algorithm to construct an unbiased estimator to compute
a k -color path (line 3). Specifically, when sampling a node the number of k -cliques. Specifically, we can slightly modify
u from Nv (G)~ , DPPathSampling needs to set a probabil-
Algorithm 4 by (1) replacing the DPCount procedure in
ity distribution D over the set Nv (G) ~ based on Eq. (13) line 4 of Algorithm 4 with the DPPathCount procedure, and
(lines 16-18). After choosing a node u, DPPathSampling (2) replacing DPSampling in line 10 of Algorithm 4 with
turns to sample the next node from Nu (G) ~ (line 19). The DPPathSampling. Due to the space limit, we omit the details
DPPathSampling procedure terminate when k nodes are of this modified algorithm. Similar to Theorems 4 and 5,
sampled. the estimator based on the k -color path sampling is also
It is important to note that Algorithm 5 can always unbiased, and the sample size can also be bounded by using
obtain a k -color path if the DAG G ~ contains at least one the Chernoff bound. Moreover, it is easy to check that the
k -color path. This is because in lines 16-18, if a node u sample size is no larger than that of Algorithm 4, because
is sampled, then H(u, k − i − 1) must be larger than 0, ρp ≥ ρc .
indicating that the out-neighborhood Nu (G) ~ must be non- For the time complexity, such a modified algorithm takes
empty. As a consequence, if there is a k -color path in G ~, O(|S|δ 2 k) to compute the DP tables (i.e., H ) for all nodes in
the for loop in line 15 of Algorithm 5 will be executed k S (because the input graph G(Nv (G)) ~ for the DPPathCount
times which results in a k -color path. The following theorem procedure has at most δ nodes), and consumes O(δk + k 2 )
shows that Algorithm 5 can obtain a uniform k -color path. to sample a k -color path. Thus, the total time complexity
of the algorithm is O(|S|δ 2 k + (δk + k 2 )t + m + n), where
Theorem 9. Algorithm 5 outputs a uniform k -color path. O(m + n) is taken for computing the graph coloring. The
space overhead of the modified algorithm is O(m + n + δk),
Proof. Consider a path {v1 , v2 , · · · , vk }. Let X be the event
because the DP table takes O(δk) space.
of this path being sampled by Algorithm 5. Denote by
Yi the event of a node vi appearing in the path. Clear-
ly, the probability of the first node v1 being sampled is 6 COLORFUL TRIANGLE-PATH SAMPLING
Pr(Y1 ) = P H(vH(u,k−1)
1 ,k−1)
. Observe that in the ith -iteration of Note that k -color paths can significantly remove the un-
u∈V
the for loop (line 15), the distribution D for node vi is con- promising k -color sets by introducing a connective con-
structed from Nvi−1 (G) ~ . The node vi being sampled in the straint (i.e., a k -color set must form a path). However, the
for loop can be represented as an event Yi |Yi−1 (conditioned k -color path is still a very sparse structure, which does not
on Yi−1 ), thus we have Pr(Yi |Yi−1 ) = P
H(vi ,k−i)
. fully capture the clique property. Specifically, k -color path
~ H(u,k−i)
u∈Nv
i−1
(G)
only guarantees the existence of k − 1 edges, which is the
As a consequence, we have smallest number of edges to maintain the connectivity. In
Pr(X) = Pr(Y1 ) × Pr(Y2 |Y1 ) × · · · × Pr(Yk |Yk−1 ) this section, we further develop a new technique, called k -
H(v1 , k − 1) H(v2 , k − 2) triangle path, to prune those unpromising k -color paths that
= P ×P × are not the k -cliques. In a simple path, any two consecutive
u∈V H(u, k − 1) ~ H(u, k − 2)
u∈Nv (G) 1 nodes form a 2-clique. Similarly, we define the concept
H(vk , 0) 1 of triangle-path. In a triangle-path, any three consecutive
··· × P = P .
~
u∈Nvk−1 (G) H(u, 0) u∈V H(u, k − 1) vertices form a triangle. When the nodes of a triangle-path
(15) have distinct colors, the triangle-path is called a colorful
Since
P the number of k -color paths in G is equal to triangle-path. In the following, we use k -triangle path to
u∈V H(u, k − 1), each k -color path is sampled uniform- refer to a colorful triangle-path with k nodes.
ly. The k -triangle path can capture the clique property bet-
ter than k -color path, which can further improve the clique
We analyze the time and space complexity of Algorith- density. However, compared to k -color path, k -triangle path
m 5 in the following theorem. is a more complex structure. It is nontrivial to design ef-
Theorem 10. Given an input graph G with n nodes and m ficient algorithms for uniformly sampling k -triangle paths.
edges, Algorithm 5 takes O(χnk + m) time and uses O(kn + m) Below, we propose a new DP algorithm to achieve this goal.
space, where χ is the maximum color value.
6.1 DP-based k -triangle path Sampling
Proof. First, the algorithm consumes O(m + n) time to
As described in Section 5.1, we assume that the graph G is
obtain a DAG. Second, as above analyzed, the DPPathCount
colored with the color values selected from [1, χ]. We can
procedure takes O(nkχ) time. Third, the DPPathSampling ~ based on the color ordering. Below, we
construct a DAG G
procedure uses O(n + χk) time. This is because setting the
formally define the concept of k -triangle path.
probability distribution for the first node takes O(n) time,
while for the other nodes it takes at most O(χ) time. Thus, Definition 2. A k -triangle path is a k -color set with vertices
the total time complexity of Algorithm 5 is O(χnk + m). {v1 , v2 , v3 , ..., vk } where c(vi ) < c(vi+1 ) for all i ∈ [1, k − 1]
For the space complexity, the algorithm needs to store the and vi , vi+1 , vi+2 form a triangle for all i ∈ [1, k − 2].

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11

Let G ~ be the DAG generated by the color ordering. Algorithm 6: DPTriSampler (G, k )
Counting the colorful triangle-paths in G is equivalent Input: A colored graph G = (V, E), and an integer k
to counting the triangle-paths in G ~ . Then, any k -clique
Output: A uniformly sampled k -triangle path
C = {v1 , v2 , · · · , vk } in G is a k -triangle path in G ~ . As ~ ← the DAG generated by the color ordering of G;
1 G
described in Section 5.1, the set of k -color paths is a subset ~ k);
2 T ← TriPathCount(G,
of the set of k -color sets. Similarly, the set of k -triangle paths ~ T, k);
is a subset of k -color paths. Thus, the k -clique density over
3 R ← DPTriSampling(G,
4 return R;
the k -triangle paths, denoted by ρt , must be no less than the
5 Procedure TriPathCount(G, ~ k)
k -clique density over the k -color paths.
6 T ((vx , vy ), 2) ← 1, for (vx , vy ) ∈ E ;
Example 3. In Fig. 1(d), the two 4-color paths in the box are k - 7 foreach j = 3 to k do
triangle paths. For example, in the 4-color path {0, 1, 5, 6}, both 8 for (vx , vy ) ∈ E do
{0, 1, 5} and {1, 5, 6} are triangles. However, in the 4-color path 9 for vz ∈ Nvx (G ~ v ) ∩ Nv (G ~ v ) do
x y y
{0, 2, 3, 6}, the three consecutive nodes {2, 3, 6} do not form a 10 T ((vx , vy ), j) ←
triangle, thus {0, 2, 3, 6} is not a k -triangle path. T ((vx , vy ), j) + T ((vy , vz ), j − 1);
Counting the number of k -triangle path. Denote by
11 return T ;
T ((vx , vy ), j) the number of j -triangle-paths with (vx , vy )
as the first edge. Then, the total number of k -color paths of 12 Procedure DPTriSampling(G, ~ T, k)
~ , denoted by cntk (G, triangle), can be computed by the
G 13 Set the probability distribution PD over the edges
following formula: in E where p(e) = T (e, k)/ e∈E T (e, k);
X 14 Sample an edge (vx , vy ) from D;
cntk (G, triangle) = T ((vi , vj ), k). (16) 15 R ← {vx , vy }; Q ← Nvx (G ~ v ) ∩ Nv (G
~ v );
x y y
~
(vi ,vj )∈G 16 for i = 3 toPk do
~ v be a subgraph of G ~ induced by {vi , · · · , vn }. 17 cnt ← vz ∈Q T ((vy , vz ), k − i + 2) ;
Let G i
~ v only has out-neighbors, other 18 Set the probability distribution D over the
Since the node vx in G x
~ v . Denote by vz nodes in Q where
nodes in the k -triangle paths are in G x p(vz ) = T ((vy , vz ), k − i + 2)/cnt for each
the third node in a j -triangle-path. It is easy to see that
vz ∈ Q;
vz is the common neighbor of vx and vy , because the 19 Sample a node vz from D;
three consecutive nodes in a k -triangle paths must form a
20 R ← R ∪ {vz };
triangle. Based on this property, we can derive the following ~ v ) ∩ Nv (G ~ v );
21 Q ← Nv y ( G
equation: y z z
22 vy ← vz ;
X 23 return R;
T ((vx , vy ), j) = T ((vy , vz ), j − 1). (17)
~ v )∩Nv (G
vz ∈Nvx (G ~v )
x y y

Initially, we have
to derive the DP table T and calls the DPTriSampling proce-
T ((vx , vy ), 2) = 1, ∀(vx , vy ) ∈ E. (18) dure to uniformly sample a k -color path (line 3). Based on E-
Based on these equations, we can easily devise a D- q. (17), DPTriSampling sets a probability distribution D over
P algorithm to compute cntk (G, triangle). The detailed the set of edges (line 13) and samples the first two nodes vx
implementation of this DP algorithm is shown in the and vy according to D (lines 14-15). With the first two nodes,
TriPathCount procedure of Algorithm 6 (lines 5-11). Specif- the set of the third nodes is the common out-neighbors of vx
ically, Line 6 initializes the DP table based on Eq. (18), and and vy (line 15). Then, DPTriSampling samples the next node
lines 7-10 is the DP procedure based on Eq. (17). from Q by setting a probability distribution over Q (lines 17-
20). The DPTriSampling procedure terminates when k nodes
Sampling a uniform k -triangle path. Similar to the algo- are sampled.
rithm to uniformly sample the k -color paths, we propose a A k -triangle path can always be obtained by Algorithm 6
DP-based sampling algorithm to uniformly draw k -triangle if the DAG G ~ contains at least one k -triangle path. This
paths. Suppose that there is a randomly selected k -triangle is because in lines 17-20, if a node vz is sampled, then
path, denoted by P . With Eq. (16), we can derive that the T ((vy , vz ), k − i + 2) must be larger than 0, indicating
T ((vx ,vy ),k)
probability of P starting by edge (vx , vy ) is cntk (G,triangle) . that the common out-neighbor of vy and vz must be non-
Then for the third node vz in P , it must be the common out- empty (line 21). As a consequence, the for loop in line 16
neighbor of vx and vy . According to Eq.(17), the number of of Algorithm 6 will be executed k − 2 times which results
k -triangle paths with (vx , vy ) as the first edge is equal to the in a k -triangle path. The following theorem shows that
sum of (k − 1)-triangle paths with (vy , vz ) as the first edge, Algorithm 6 can obtain a uniform k -color path.
T ((v ,v ),k−1)
thus the probability of vz being sampled is T ((vxy ,vyz ),k−2) .
Theorem 11. Algorithm 6 outputs a uniform k -triangle path.
Similar mechanism can be applied to sample the next nodes.
The detailed implementation is shown in Algorithm 6. Proof. Consider a path {v1 , v2 , · · · , vk }. Let X be the event
Algorithm 6 first constructs a DAG G ~ by the color of this path being sampled by Algorithm 6. Denote by Yi
ordering (line 1). Then, the algorithm invokes TriPathCount the event of two nodes vi−1 and vi appearing in the path.

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12

Clearly, the probability of the first two nodes v1 and v2 Algorithm 6. Since the estimator is very similar to those
T ((v ,v ),k)
being sampled is Pr(Y1 ) = P 1 T2(e,k) . Observe that in the shown in Section 4.2 and Section 5.2, we omit the details
e∈E
ith -iteration of the for loop (line 16), the nodes in Q are the for brevity.
common out-neighbor of vi−1 and vi−2 . Thus, the event that For the time complexity, such a modified algorithm takes
a node vi is sampled in the ith for loop can be represented O(4S k) to compute the DP tables (i.e., the DP table T in
as Yi |Yi−1 (conditioned on Yi−1 ). We have Pr(Yi |Yi−1 ) = Algorithm 6) for all nodes in S where 4S is the sum of the
P T ((vi−1 ,vi ),k−i+2)
= number of triangle in the dense region S . It also consumes
T ((vi−1 ,u),k−i+2)
u∈Nv
i−1
(Gv~
i−1
)∩Nv
i−2
(Gv~
i−2
)
O(χk +k 2 ) to sample only one k -triangle path, where O(k 2 )
T ((vi−1 ,vi ),k−i+2)
T ((vi−2 ,vi−1 ),k−i+3) . As a result, we have is the time to check whether the sampled k nodes is a k -
clique. Thus, the total time complexity of the algorithm is
Pr(X) = Pr(Y2 ) × Pr(Y3 |Y2 ) × · · · × Pr(Yk |Yk−1 )
O(4S k + (χk + k 2 )t + m + n), where O(m + n) is taken for
T ((v1 , v2 ), k) T ((v2 , v3 ), k − 1)
= P × × computing the graph coloring. The space overhead of the
e∈E T (e, k) T ((v1 , v2 ), k)
modified algorithm is O(δ 2 k), because the DP table takes
T ((vk−1 , vk ), 2) (19) O(m0 k) space where m0 is the maximum number of edges
··· ×
T ((vk−2 , vk−1 ), 3) ~ v and it must satisfy m0 ≤ δ 2 .
for G
1
= P .
e∈E T (e, k)

Since the number of k -triangle paths in G is equal to


P
e∈E T (e, k), each k -triangle path is sampled uniform- 6.3 Discussion
ly.
In this subsection, we analyze the relationships among three
Example 4. Let k = 4. In Fig. 1(a), there are two k -triangle
proposed algorithms and analyze in which case k -triangle
paths {0, 1, 5, 6} and {0, 2, 3, 4}. We have T ((0, 1), 4) = 1,
paths is better than k -color sets and k -color paths.
which means the count of k -triangle paths with head (0, 1) is
1. So the probability of (0, 1) being sampled as the first two According to Theorem 5, the sample size needed to
vertices is
T ((0,1),4)
= 12 (line 13 of Algorithm 6). The set of compute an accurate estimate is ρ32 ln σ1 where ρ is the clique
2
common neighbors of 0 and 1 is {5}. Thus the probability of density and , σ are small constant numbers. This bound is
5 being sampled as the third vertex is 1. Similarly, the set of only determined by the clique density. In other words, if the
common neighbors of 1 and 5 is {6}, and the probability of 6 sample size is fixed, the clique density is the only factor
being sampled is 1. At last, the probability of {0, 1, 5, 6} being that has effect on the accuracy. When ρ is very small, it
sampled is 21 × 1 × 1 = 21 . The probability of {0, 2, 3, 4} is also needs quite a large size of samples to achieve an accurate
1 answer. All the Bernoulli-style sampling algorithms have
2 . Therefore, the probability of each k -triangle path being sampled
is equal. this property [34].
Denote by ρc , ρp , ρt the clique density over the k -color
Remark. Note that for all our algorithms, the clique den- sets, k -color paths, k -triangle paths, respectively. ρt is al-
sity is a fixed value of a network. For example, for the ways the largest one as described in the following. Since
k -triangle path sampling algorithm, the clique density is a k -triangle path must be a k -color path and k -color path
determined by the count of the cliques among the k -triangle must be a k -color set, it is easy to derive that the set of
paths. Reconsider the graph in Fig. 1(a), there is a 4-clique k -triangle paths is a subset of k -color paths and the set
{0, 2, 3, 4} in the two triangle paths {0, 2, 3, 4}, {0, 1, 5, 6}, of k -color paths is a subset of k -color sets. Thus it has
and the clique density is 0.5. ρt ≥ ρp ≥ ρc . For example, in Fig. 1(a), when k = 4,
We analyze the time and space complexity of Algorith- it has ρc = 18 , ρp = 13 , ρt = 12 (as shown in Fig. 1(b)
m 6 in the following theorem. and Fig. 1(d)). Table 2 in experiment further shows the
relationship. We also plot the change tendency of ρc , ρp and
Theorem 12. The procedure TriPathCount in Algorithm 6 takes
ρt on two representative datasets, Stanford and Orkut, in
O(k4) time and uses O(km) space, where 4 is the number of
Fig.2. In Fig.2, ρt is the most largest and robust when k
triangles of the input graph. The procedure DPTriSampling in
becomes large. Thus the estimator based on the k -triangle
Algorithm 6 takes O(m + χk) time.
paths needs smaller sample size.
Proof. It is easy to see that the total time costs of Line 8 However, smaller sample size does not mean less run-
and Line 9 in Algorithm 6 is bounded by O(4). Thus, the ning time. According to Theorem 6, the proposed sampling
TriPathCount procedure takes at most O(k4) time. For the based algorithms are composed of two steps. The first step
space complexity, the algorithm needs to store the DP table is the computation of dynamic programming table. The
T which uses O(mk) space. In DPTriSampling, setting the second step is sampling t samples according to the distri-
probability distribution for the first two nodes takes O(m) bution defined by the dynamic programming table. Since k -
time (line 13), while for the other nodes it takes at most O(χ) triangle path is a more complex structure than DPColor and
time. Thus, the total time complexity of DPTriSampling is DPColorPath, the computation of dynamic programming
O(m + χk). needs more running time, as described in Theorem 3, 10
and 12. Thus k -triangle path is better than k -color set and
6.2 Estimating the k -clique counts k -color path when ρp and ρc is quite smaller than ρt . As
Similar to Section 4.2 and Section 5.2, we can construct shown in Fig. 2, this happens when k becomes large on real-
an unbiased k -clique estimator based on Algorithm 4 and world networks.

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13


0
1.0 10 TABLE 1
DPColor
0.9 10−1 DPColorPath Datasets
0.8 10−2 DPTriPath
density

density
DPColor
0.7 DPColorPath 10−3 Networks n m δ
0.6 DPTriPath
10−4
0.5 10 −5 Themaker 69, 413 3,289,686 164
0.4 Stanford 281,903 1,992,636 71
10−6
5 10 15 20 25 5 10 15 20 25 DBLP 425,957 1,049,866 113
k k Google 916,428 4,322,051 44
Skitter 1,696,415 11,095,298 111
(a) Stanford (b) Orkut Orkut 3,072,627 117,185,083 253
Fig. 2. Comparing ρc , ρp , and ρt for different k. LiveJournal 4,036,538 34,681,189 360
Friendster 65,608,366 1,806,067,135 304
Algorithm 7: The Adaptive Sampling Framework
Input: A graph G = (V, E), the dense part S of G, an
101

Relative error(%)

Relative error(%)
integer k, and the error bound  101
100
Output: A (1 − )-approximation of the density of −1
10 10−1
k-cliques.
10−2 PEANUTS PEANUTS
1 C ← 0; T ← 0; 10−3 10−3
DPColor DPColor
2 t ← 10000; 10−4 DPColorPath DPColorPath
10−5
3 threshold ← 32 ln σ1 ; 10−5 DPTriPath DPTriPath
4 while C < threshold do 103 104 105 106 107 108 103 104 105 106 107 108
~ S, k, t); Number of samples Number of samples
5 Sampling(G,
6 T ← T + t; (a) Stanford (b) LiveJournal
7 c ← count of sampled cliques among the t samples; Fig. 4. Relative errors with varying sample size (k = 8)
8 C ← C + c;
9 if C > 0 then t ← threshold/C × T ; accordingly (line 9). The adjusting method in line 9 is a sim-
10 else t ← t × 10; ple yet effective way to make the value of C approaching the
C
11 return T
; threshold. At last, the approximation is returned (line 11).
Instead of time complexity, we analyze the upper bound
of the sampling times of Algorithm 7, i.e. the value of T in
7 A DAPTIVELY D ETERMINING THE S AMPLE S IZE Algorithm 7, which is the key to the running time. We omit
the proof because it is quite clear.
In Algorithm 2, it needs to set the sample size t as a
fixed value. The advantage of a fixed sample size is that Theorem 14. The sampling times of Algorithm 7 is
the running time can be controlled by the parameter t. O(max(104 , ρ32 ln σ1 )).
However, there is no confirmation that the results given
by Algorithm 2 are accurate. To overcome this problem, we The advantage of the new framework is that it can
provide a new framework that can guarantee the accuracy. guarantee the accuracy of the results. The disadvantage is
The key idea of the new framework is based on the that the time complexity of our algorithm depends on the
concept that an estimate is accurate if the number of cliques clique density. Therefore, when k is large (e.g., k > 25), the
in the samples exceeds a threshold. We set the threshold as clique density might be extremely small, resulting in that the
3 1 algorithm requires a large number of samples to achieve a
2 ln σ according to Theorem 13. Theorem 13 explains the
idea more clearly. good accuracy guarantee. In this case, the algorithm may be
costly to obtain a good approximation. Fortunately, for real-
Theorem 13. Suppose that the sample size is t and the number of world applications, k is often not very large (e.g., k < 20),
k -cliques in the t sample size is c. ρ̂ = ct is a 1 −  approximation our algorithm is very efficient and extremely fast in practice
of ρ with probability 1 − 2σ if c ≥ 32 ln σ1 . as shown in our experiments. In fact, in subgraph counting
3 1 field, there are no existing algorithms that have both poly-
Proof. Since c = ρ̂t, it has t ≥ ρ̂2 ln σ . Then the theorem can
nomial time complexity and strong accuracy guarantee [35].
be proved by Theorem 5.

Theorem 13 describes that ρ̂ is accurate only if c is large Example 5. To aid understanding, we describe how the adaptive
enough, regardless of the value of t. Based on this idea, sampling framework works on the Orkut network with  = 0.05,
we design a new framework that keeps sampling until c is δ = 0.01 and DPPathSampler. The threshold in line 3 is 5519.
larger than the threshold. In the new framework, we utilize The real clique density is 0.0132. At first, the framework samples
the Adaptive Sampling to adapt the sample size according 104 times and get 91 cliques. Now the estimated density is 0.0091
to the existing sampling results. If there are C cliques in T and the error is 0.0132−0.0091 = 0.31, which is larger than .
0.0132
samples already and we needs threshold cliques in total, According to the adaptive sampling method, to let the count of the
the following sample size should be threshold/C × T . sampled cliques larger than threshold, we need threshold/C ×
The details of the new framework is shown in Algorith- T = 606483 more samples (line 9). After sampling, there are
m 7. Algorithm 7 inputs an error bound  and returns a 7837 cliques in the 606483 samples. Now there are C = 7837 +
(1 − )-approximation. At first, it samples 103 samples to 91 cliques among the T = 606483 + 10000 samples, and the
test the clique density (line 2). If no clique is sampled, use estimated clique density is 0.0129. The error is 0.0132−0.0129 =
0.0132
more samples to test the clique density (line 10). If there 0.02, which is smaller than .
exists cliques in the T samples, adjust the count of samples

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14

kClist PIVOTER PEANUTS DPColor DPColorPath DPTriPath


INF INF INF INF
104 104 104 104
10 3 103 103
103
Time(s)

Time(s)

Time(s)

Time(s)
102
102 102
102 101
101 101
100
101 100 100
100 10−1
8 9 10 11 12 5 7 9 11 13 15 5 7 9 11 13 15 5 7 9 11 13 15
k k k k
(a) Themaker (b) webStanford (c) DBLP (d) webGoogle

INF INF INF INF


104 104 104
104
103
Time(s)

Time(s)

Time(s)

Time(s)
103
102 103
102
101
103
100 102 101
5 7 9 11 13 15 5 7 9 11 13 15 4 5 6 7 8 5 7 9 11 13 15
k k k k
(e) Skitter (f) Orkut (g) LiveJournal (h) Friendster

Fig. 3. Running time of different algorithms (the relative errors for PEANUTS, DPColor, DPColorPath and DPTriPath are set to 0.1%)

104 PEANUTS DPColorPath INF TABLE 2


900
Running Time (s)
Relative error(%)

3 DPColor DPTriPath PEANUTS DPColorPath The k-clique densities (ρc /ρp /ρt ) in the dense regions (%)
10
DPColor DPTriPath
102 600
101 Networks k=8 k = 15
300
100 Themaker 0.001/0.003/0.02 0.0/0.0/1e-7
Stanford 70.8/77.7/97.3 45.8/53.1/91.7
10−1 0
16 20 24 28 16 20 24 28 DBLP 100.0/100.0/100.0 100.0/100.0/100.0
k k Google 91.2/94.4/96.0 84.7/85.7/85.9
Skitter 5.3/26.4/47.8 0.1/2.3/5.9
(a) The relative error (b) The running time needed to let Orkut 0.0/2.6/22.8 0.0/0.0002/1.4
error lower than 10% LiveJournal 80.4/91.0/95.0 -/-/-
Friendster 0.0/18.1/ 62.9 0.0/52.2/78.0
Fig. 5. The performance on Orkut when k is large
8 E XPERIMENTS
TABLE 3
8.1 Experimental setup Runtime of our parallel algorithms (k = 8, t = 5 × 106 , sec.)
We compare the proposed algorithms with three state-of-
the-art k -clique counting algorithms which are kClist [15], Datasets Algorithms
Threads
[16], PIVOTER [17], TuranShadow [20]. The kClist algorithm 1 4 8 12 16
is an exact k -clique counting algorithm which is based DPColor 24.8 7.1 4.3 2.7 2.1
on k -clique enumeration [15]. Note that the original kClist LiveJournal DPColorPath 28.4 7.5 3.9 2.7 2.1
DPTriPath 142.67 37.98 18.94 12.80 9.81
algorithm is based on the degeneracy ordering. Li et al.
[16] proposed an improved version based on a hybrid DPColor 2481.5 650.3 341.6 244.4 196.5
Friendster DPColorPath 2132.2 559.6 293.3 210.3 171.9
of the degeneracy and color ordering. In our experiment, DPTriPath 2430.32 636.94 336.31 239.68 197.01
kClist denotes such an improved version. PIVOTER and
TuranShadow are the state-of-the-art exact and approximate
networks. DBLP is a co-authorship network, and Skitter
k -clique counting algorithms respectively. Both PIVOTER
is an internet graph. Themaker, Orkut, LiveJournal, and
and TuranShadow were proposed by Jain and Seshadhri [17],
Friendster are social networks. All datasets are downloaded
[20]. PEANUTS [27] is an improved version of TuranShadow
from (snap.stanford.edu) and (https://networkrepository.
which is more efficient than TuranShadow, thus we use
com/networks.php).
PEANUTS as the baseline instead of TuranShadow. The
C++ codes of all these algorithms are publicly available,
thus we use their implementations in our experiments. For 8.2 Experimental results
our algorithms, we implement DPColor, DPColorPath and Exp 1: Runtime of different algorithms. In this experiment,
DPTriPath. The three algorithm are Algorithm 2 integrated we compare the running time of different algorithms on
with three sampling algorithms. All of them are implement- all datasets. Note that for each approximation algorithm
ed in C++. All algorithms are evaluated on a PC with two (PEANUTS, DPColor, DPColorPath and DPTriPath), we
2.1 GHz Xeon CPUs (16 cores in total) and 128GB memory record its running time when the algorithm achieves a
running CentOS 7.6. 0.1% relative error. Here the relative error is computed by
Datasets. We use 8 large real-life datasets in our experi- |f − fˆ|/f , in which f is the exact k -clique count and fˆ is the
ments. Table 1 summarizes the detailed statistic information estimated count. For all algorithms, if they cannot terminate
of all datasets. The last column of Table 1 denotes the within 5 hours, we set their running time to “INF”. Fig. 3
degeneracy of the graph. Stanford and Google are web shows the running time of various algorithms.

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15

We first compare our algorithms with kClist and TABLE 4


PIVOTER. As can be seen, all of our algorithms DPColor, The ratio of the k-cliques in the sparse regions.
DPColorPath and DPTriPath are significantly faster than Networks k=3 k=8 k = 12 k = 15
kClist and PIVOTER on most datasets with varying k .
Themaker 1.53% 1.24% 0.02% 0.30%
The kClist algorithm is generally intractable for large k
Stanford 6.15% 0.01% 0.00% 0.00%
on all datasets. On most datasets, DPColorPath is around DBLP 33.11% 0.00% 0.00% 0.00%
one order of magnitude faster than PIVOTER. The hardest Google 11.63% 6.47% 0.69% 0.36%
instance is the LiveJournal graph, on which PIVOTER only Skitter 19.57% 0.03% 0.00% 0.00%
obtains the number of 4-cliques within 5 hours, whereas Orkut 6.47% 0.07% 0.00% 0.00%
DPColorPath takes around 20 seconds to achieve a 0.1% LiveJournal 9.30% 0.00% 0.00% 0.00%
relative error (DPColorPath can achieve at least three orders Friendster 26.15% 0.30% 0.01% 0.00%
of magnitude faster than PIVOTER on LiveJournal). Note
that since both kClist and PIVOTER are intractable on sample size is large enough to let the relative error of both
LiveJournal when k ≥ 6, we use the exact k -clique count ob- the DPColorPath and DPColor be around 10−5 . In general,
tained from [36], where k ≤ 8, to compute the relative errors the relative errors of all algorithms decrease with the sample
for the approximation algorithms. Moreover, as reported in size increases. Moreover, we can see that both DPColorPath
[36], the running time of such a GPU-parallelized PIVOTER and DPTriPath obtain a 10−5 relative error on all datasets
algorithm using 5120 CUDA Cores is 6,851 seconds for when the sample size is 108 , indicating that DPColorPath
k = 8, while our sequential DPColorPath (DPColor) take and DPTriPath can achieve very high accuracy using a
around 20 seconds to obtain a very accurate k -clique count. reasonable number of samples. These results further confirm
On Orkut and LiveJournal, the exact algorithm PIVOTER is the efficiency and effectiveness of our techniques.
faster than DPColor. This is because the clique density over Exp 3: Performance of different algorithms for a large k .
DPColor is almost zero in Orkut and Friendster, as shown in In this experiment, we evaluate the performance of different
Table 2. According to Theorem 5, it needs a large number of sampling algorithms for a large k . We set the sample size as
samples to guarantee the accuracy when the clique density 5 × 107 on Orkut for all algorithms in Fig. 5(a). As can be
is small. The results happens on the datasets that have very seen, the error rates of all algorithms increase as k increases.
small cliques density. These experiment results indicate that DPColor and PEANUTS cannot obtain accurate and valid
our algorithms are extremely efficient for k -clique counting. results for large k . Only DPTriPath can constantly achieve
By comparing our algorithms with PEANUTS, we can a relative error below 10% for large k . And DPTriPath
see that DPColor, DPColorPath and DPTriPath are all con- consistently outperforms DPColorPath in at least one order
sistently faster than PEANUTS on all datasets with varying of magnitude. This is because DPTriPath is more powerful
k . On most datasets, DPColorPath is orders of magnitude to capture the clique property than DPColorPath, and the
faster than PEANUTS. For example, on DBLP, DPColor, clique density over DPTriPath is larger than DPColorPath.
DPColorPath and DPTriPath all take around 0.1 second, Fig. 5(b) compares the running time of the algorithms to
while PEANUTS consumes more than 1 seconds for most make the relative error below 10%. DPTriPath is more ro-
k values. In addition, on Orkut and Friendster, PEANUTS bust than DPColorPath when k becomes large. These results
and DPColor cannot achieve a desired relative error within 5 show the advantage of DPTriPath.
hours for large k values, while DPColorPath and DPTriPath Exp 4: K -clique density. In this experiment, we evaluate
are still very efficient on these two datasets. For our al- the k -clique densities over the k -color sets (ρc ), the k -color
gorithms, both DPColorPath and DPTriPath are generally paths (ρp ) and the k -triangle paths (ρt ) in the dense regions
faster than DPColor on large graphs. Moreover, the perfor- of the graph, respectively. The results on all datasets are
mance of DPColorPath and DPTriPath is much more stable reported in Table 2. As expected, ρc is lower than ρp , and
than DPColor on all datasets. Additionally, we can also see ρp is lower than ρt on all datasets. Moreover, all ρc , ρp
that in the graph Themaker and Friendster, DPTriPath is and ρt can achieve a very high value on most datasets.
faster than DPColorPath. The reason is that DPTriPath can For example, on DBLP, all of them are near to 100%. In
achieve a much higher clique density than DPColorPath, general, they decrease with k increases. Nevertheless, on
thus it needs much less samples to achieve 0.1% relative most datasets, ρt is always very large even when k = 15.
error. Although the complexity of DPTriPath is higher than These results further confirm that the proposed techniques
DPColorPath to draw a sample, it needs much less samples, can achieve high accuracy on real-life graphs. Note that in
thus it can be faster than DPColorPath. These results confirm Fig.3(f) and Fig.3(h), the exact algorithm PIVOTER is faster
our theoretic analysis in Sections 4, 5 and 6. than the proposed DPColor algorithm. This is because the
Exp 2: Relative errors with varying sample size. Fig. 4 clique density over k -color set is almost zero on Orkut and
shows the relative errors of three algorithms with varying Friendster, as shown in Table 2. According to Theorem 5,
sample size on Stanford and LiveJournal. Similar results it needs a large number of samples to guarantee the 0.1%
can also be observed on the other datasets. As shown in accuracy.
Fig. 4, the relative error of DPColorPath is lower than those Exp 5: Memory overheads. Fig. 6 shows the memory usages
of DPColor and PEANUTS on most cases, and DPTriPath of various algorithms on Themaker and LiveJournal for
is further lower than DPColorPath. When the sample size k = 8. The results for the other k values and dataset-
is 108 , the relative error of DPColorPath is slightly larger s are consistent. As expected, the space consumption of
than those of DPColor on LiveJournal. This is because the PEANUTS is significantly higher than the other algorithms,

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 16

TABLEq 5 TABLE 7
The maximum clique size/ max δv / d kα + 21 e of the sparse regions The performance of the adaptive sampling technique in Algorithm 7 for
of different graphs various density (δ = 0.01).

Networks δ k=3 k=8 k = 12 k = 15 Density Parameters  T Error

DBLP 113 4/7/19 13/16/31 20/23/37 27/30/42 DPColor, 0.01 320358 0.000
Skitter 111 8/11/19 20/23/30 27/30/37 35/38/41 0.86 k = 12, 0.05 7254 0.001
Orkut 253 10/13/28 22/26/45 30/34/56 35/38/62 Google 0.1 2563 0.003
LiveJournal 360 8/11/33 19/22/54 24/28/66 28/31/74 DPColorPath, 0.01 273610 0.000
Friendster 304 21/24/31 47/50/50 50/63/61 60/63/68 0.52 k = 15, 0.05 20550 0.000
Friendster 0.1 10000 0.003

graph size DPTriPath, 0.01 1699744 0.002


6000 0.16 k = 9, 0.05 37527 0.002
kClist
Orkut 0.1 12945 0.066
Memory (MB)

5000 PIVOTER
4000 PEANUTS DPTriPath, 0.01 2464188 0.001
DPColor 0.06 k = 15, 0.05 103032 0.002
3000
DPColorPath Skitter 0.1 33517 0.003
2000 DPTRiPath
DPColor, 0.01 25089143 0.004
1000 0.0067 k = 12, 0.05 1672969 0.002
0 Skitter 0.1 314110 0.009
Themaker com-LiveJournal
DPColorPath, 0.01 1120373294 0.002
0.0002 k = 15, 0.05 38053125 0.009
Fig. 6. Memory usage of various algorithms (k = 8) Orkut 0.1 10222666 0.021

as it needs to store the partial Tuŕan Shadow structure. The TABLE 8


space overheads of our algorithms and PIVOTER are com- Compare the running time of DPColor, DPColorPath, DPTriPath with
adaptive sample size (δ = 0.01, k = 24).
parable, while kClist consumes slightly more space than our
algorithms. These results demonstrate that our algorithms
are space efficient. Time (s)
Datasets 
DPColor DPColorPath DPTriPath
Exp 6: Parallel performance of our algorithms. In this
0.01 1.21 0.94 2.84
experiment, we evaluate the parallel performance of our Stanford 0.05 0.44 0.45 1.76
algorithms. To this end, we implement the parallel versions 0.1 0.41 0.41 1.75
for DPColor, DPColorPath and DPTriPath using OpenMP. 0.01 INF 100.99 48.43
We fix the sample size as 5 × 106 to evaluate the runtime on Skitter 0.05 INF 8.95 8.89
0.1 INF 7.19 6.36
the two largest datasets. The results are shown in Table 3.
0.01 INF INF INF
As can be seen, all of DPColor, DPColorPath and DPTriPath Orkut 0.05 INF INF 822.02
can achieve 12× ∼ 14× speedups when using 16 threads. 0.1 INF INF 308.60
This result indicates a high degree of parallelism of our 0.01 26.25 27.52 172.16
algorithms. LiveJournal 0.05 24.93 25.88 139.91
0.1 24.90 25.85 138.29
Exp 7: The number of k -cliques in the sparse regions. In
this experiment, we evaluate the number of k -cliques in the
sparse regions of a graph on all datasets. Note that a node’s each graph is relatively small compared with the degeneracy
neighborhood-induced subgraph is called a sparse region of δ of the entire graph, where the degeneracy value is the
a graph if the average degree of such a subgraph is smaller upper bound of the maximum clique qsize. Moreover, the
than k . Clearly, if the sparse regions have less number of max δv is much smaller than δ and d kα + 12 e is also not
k -cliques, the PIVOTER algorithm should be more efficient. very large, which further indicates the high effectiveness of
Table 4 reports our results on all datasets. As can be seen, the proposed solution.
for a relatively large k , the number of k -cliques in the sparse
Exp 9: Test different values of threshold to split network.
regions of all datasets only accounts for a small portion of
Table 6 shows the performance of DPColor, DPColorPath
the total number of k -cliques. On most datasets, such a ratio
and DPTriPath on different values of threshold to split the
usually does not exceed 0.1%. These results indicate that
networks. In Table 6, the total running time increases and
the proposed framework, which integrates both PIVOTER
the relative error decreases on most cases as the threshold
and sampling techniques, can be very efficient for handling
increases. For example, on Orkut, the total running time of
real-life graphs.
DPTriPath is 239.8s, 251.5s, 310.6s and the relative error is
Exp 8: Maximum clique size in the sparse regions. Table 5 0.25%, 0.20%, 0.20% for the thresholds of 0.5k, k, 2k respec-
shows the maximum clique size, the maximum degeneracy tively. However, the results under different values of thresh-
among the subgraphs, i.e. max δv where δv is the
qdegeneracy old differs no more than an order of magnitude. According
~
of the subgraph G(Nv (G)) and the value of d kα + 1 e on to these results, we can conclude that our algorithm is not
2
the sparse regions of the graph (Theorem 1). Recall that very sensitive to the threshold value.
the PIVOTER algorithm is based on the enumeration of Exp 10: Results with adaptive sample size. Table 7 shows
maximal cliques, thus the maximum clique size bounds the performance of Algorithm 7 over different density. The
the recursion depth of PIVOTER. From Table 5, we can value of δ is set as 0.01. In Table 7, the columns are (1) clique
observe that the maximum clique in the sparse region of density, (2) the sampler, the value of k and the network,

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 17

TABLE 6
The affect of different threshold to split network. (k = 8, t = 5 × 106 )

Time (s): Exact / Sampling / Total Error (%) Clique density


Datasets Algs
0.5k k 2k 0.5k/k/2k 0.5k/k/2k
DPColor 2.7/0.6/3.3 3.6/0.5/4.2 5.3/0.5/5.8 0.47/0.29/0.29 0.05/0.05/0.06
Skitter DPColorPath 2.7/1.0/3.6 3.6/0.7/4.3 5.3/0.4/5.8 0.17/0.13/0.10 0.26/0.26/0.26
DPTriPath 2.7/3.5/6.2 3.6/3.0/6.6 5.3/2.4/7.7 0.14/0.10/0.04 0.48/0.48/0.48
DPColor 0.3/0.5/0.7 0.5/0.4/0.8 0.5/0.3/0.9 0.14/0.04/0.02 0.90/0.91/0.92
Google DPColorPath 0.3/0.4/0.6 0.5/0.2/0.7 0.5/0.2/0.7 0.14/0.04/0.02 0.94/0.94/0.94
DPTriPath 0.3/0.7/1.0 0.5/0.4/0.9 0.5/0.2/0.8 0.06/0.02/0.02 0.96/0.96/0.95
DPColor 108.2/4.1/112.4 138.8/2.3/141.1 224.4/1.2/225.6 97.92/93.95/93.89 0.00/0.00/0.00
Orkut DPColorPath 108.1/50.2/158.3 138.6/37.5/176.1 224.9/22.3/247.1 5.38/5.27/3.67 0.03/0.03/0.03
DPTriPath 109.0/130.8/239.8 136.6/114.9/251.5 224.6/86.0/310.6 0.25/0.20/0.20 0.23/0.23/0.23

(3) the value of error bound  in Algorithm 7, (4) the total be used to count k -cliques. Notable example include the
sample size, i.e. the value of T in Algorithm 7, and (5) the color coding based algorithms [23], [26], and edge sampling
estimate error. As shown in Table 7, no matter what the based algorithms [18]. However, as shown in [20], all these
value of density, the estimate error is consistently smaller algorithms cannot scale for large graphs and also their
than the given expected error bound . The value of T tends practical performance is worse than TuranShadow.
to becomes larger when the clique density and the error
bound  becomes smaller. These results are consistent with 10 C ONCLUSION
Theorem 13, which confirms that Algorithm 7 can achieve a
In this paper, we propose a time and space efficient frame-
good accuracy guarantee.
work for k -clique counting. Our framework first divides the
Table 8 shows the running time of DPColor, DPColorPath
graph into sparse and dense regions based on the average
and DPTriPath equipped with Algorithm 7 when k = 24.
degree. Then, for the sparse regions, we use the state-of-
The ”INF” means that the adaptive sample size exceed-
the-art PIVOTER algorithm to compute the exact number
s 1010 . In Table 8, DPColor and DPColorPath are faster
of k -cliques. For the dense regions, we develop three novel
than DPTriPath on Stanford and LiveJournal, and slower
DP-based k -color set, k -color path, and k -triangle path sam-
on Skitter and Orkut. This is because the clique density
pling techniques to estimate the k -clique count, respectively.
differs on these datasets. In Table 8, ρt is much larg-
Extensive experiments on 8 real-life graphs show that our
er than ρc and ρp on Skitter and Orkut, and they are
algorithms are very efficient and accurate and also use less
similar on Stanford and LiveJournal. For example, it has
space than the state-of-the-art algorithms.
ρc = 0.00002, ρp = 0.001, ρt = 0.005 on Skitter and
ρc = 0.37, ρp = 0.46, ρt = 0.88 on Stanford. These results
further confirm the analysis in Section 6.3. ACKNOWLEDGMENTS
This work was partially supported by (i) Nation-
9 F URTHER RELATED WORK al Key Research and Development Program of China
2020AAA0108503, (ii) NSFC Grants U2241211, 62072034,
K -clique and triangle counting. Except the practical algo- and (iii) CCF-Huawei Populus Grove Fund.
rithms introduced above, there also exist some theoretical
studies on the k -clique counting problem [37], [38], [39], [40]. R EFERENCES
Most of these theoretical work focus mainly on devising an
algorithm to achieve a better worst-case time complexity. [1] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and
U. Alon, “Network motifs: Simple building blocks of complex
The practical performance of such algorithms is often much networks,” Science, vol. 298, no. 5594, pp. 763–764, 2010.
worse than the state-of-the-art practical algorithms [16]. [2] S. R. Burt, “Structural holes and good ideas,” American Journal of
Triangle is a specific k -clique for k = 3. The problem of Sociology, vol. 110, no. 2, pp. 349–399, 2004.
[3] K. Faust, “A puzzle concerning triads in social networks: Graph
counting triangles in a graph has a long history. There constraints and the triad census,” Soc. Networks, vol. 32, no. 3, pp.
are many algorithms in the literature [31], [41], [42], [43], 221–233, 2010.
[44]. For example, both [41] and [42] are ordering-based [4] N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling interactome:
exact triangle counting algorithms. Chu and Cheng [43] scale-free or geometric?” Bioinform., vol. 20, no. 18, pp. 3508–3515,
2004.
developed an I/O-efficient algorithm exact algorithm for [5] C. Seshadhri and S. Tirthapura, “Scalable subgraph counting: The
triangle listing. Tsourakakis et al. [31] proposed an edge methods behind the madness,” in WWW, 2019.
sampling algorithm to approximate the number of triangles [6] J. W. Berry, B. Hendrickson, R. A. Laviolette, and C. A. Phillips,
“Tolerating the community detection resolution limit with edge
in a graph. Becchetti et al. [44] presented an approximate weighting,” Physical Review E Statistical Nonlinear & Soft Matter
triangle counting algorithm in the semi-streaming model. Physics, vol. 83, no. 5, p. 056119, 2011.
Tom et al. [45] and Hu et al. [46] developed efficient GPU- [7] B. Sun, M. Danisch, T. H. Chan, and M. Sozio, “Kclist++: A simple
parallel algorithms for triangle counting in the shared- algorithm for finding k-clique densest subgraphs in large graphs,”
Proc. VLDB Endow., vol. 13, no. 10, pp. 1628–1640, 2020.
memory many-core platforms. [8] C. E. Tsourakakis, “The k-clique densest subgraph problem,” in
Motif counting. Many exact and sampling-based approxi- WWW, 2015.
[9] A. E. Sariyüce, C. Seshadhri, A. Pinar, and Ü. V. Çatalyürek,
mation algorithms have been proposed for motif counting “Finding the hierarchy of dense subgraphs using nucleus decom-
[18], [23], [26], [35], [47], [48]; and some of them can also positions,” in WWW, 2015.

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Transactions on Knowledge and Data Engineering. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/TKDE.2023.3314643

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 18

[10] A. R. Benson, D. F. Gleich, and J. Leskovec, “Higher-order organi- [39] K. Censor-Hillel, Y. Chang, F. L. Gall, and D. Leitersdorf, “Tight
zation of complex networks,” Science, vol. 353, no. 6295, 2016. distributed listing of cliques,” in SODA, 2021.
[11] H. Yin, A. R. Benson, and J. Leskovec, “Higher-order clustering in [40] L. Gianinazzi, M. Besta, Y. Schaffner, and T. Hoefler, “Parallel
networks,” Physical Review E, vol. 97, no. 5, p. 052306, 2017. algorithms for finding large cliques in sparse graphs,” in SPAA,
[12] N. Chiba and T. Nishizeki, “Arboricity and subgraph listing algo- 2021.
rithms,” SIAM J. Comput., vol. 14, no. 1, pp. 210–223, 1985. [41] M. Latapy, “Main-memory triangle computations for very large
[13] I. Finocchi, M. Finocchi, and E. G. Fusco, “Clique counting in (sparse (power-law)) graphs,” Theor. Comput. Sci., vol. 407, no. 1-3,
mapreduce: Algorithms and experiments,” ACM J. Exp. Algorith- pp. 458–473, 2008.
mics, vol. 20, pp. 1.7:1–1.7:20, 2015. [42] M. Ortmann and U. Brandes, “Triangle listing algorithms: Back
[14] K. Makino and T. Uno, “New algorithms for enumerating all max- from the diversion,” in ALENEX, 2014.
imal cliques,” in 9th Scandinavian Workshop on Algorithm Theory, [43] S. Chu and J. Cheng, “Triangle listing in massive networks and its
2004. applications,” in KDD, 2011.
[15] M. Danisch, O. Balalau, and M. Sozio, “Listing k-cliques in sparse [44] L. Becchetti, P. Boldi, C. Castillo, and A. Gionis, “Efficient semi-
real-world graphs,” in WWW, 2018. streaming algorithms for local triangle counting in massive graph-
[16] R. Li, S. Gao, L. Qin, G. Wang, W. Yang, and J. X. Yu, “Ordering s,” in KDD, 2008.
heuristics for k-clique listing,” Proc. VLDB Endow., vol. 13, no. 11, [45] A. S. Tom, N. Sundaram, N. K. Ahmed, S. Smith, S. Eyerman,
pp. 2536–2548, 2020. M. Kodiyath, I. Hur, F. Petrini, and G. Karypis, “Exploring
optimizations on shared-memory platforms for parallel triangle
[17] S. Jain and C. Seshadhri, “The power of pivoting for exact clique
counting algorithms,” in HPEC, 2017.
counting,” in WSDM, 2020.
[46] L. Hu, L. Zou, and Y. Liu, “Accelerating triangle counting on
[18] M. Rahman, M. A. Bhuiyan, and M. A. Hasan, “Graft: An efficient
GPU,” in SIGMOD, 2021.
graphlet counting method for large graph analysis,” IEEE Trans.
[47] N. Pashanasangi and C. Seshadhri, “Efficiently counting vertex
Knowl. Data Eng., vol. 26, no. 10, pp. 2466–2478, 2014.
orbits of all 5-vertex subgraphs, by EVOKE,” in WSDM, 2020.
[19] N. Alon, R. Yuster, and U. Zwick, “Color-coding: a new method [48] A. Pinar, C. Seshadhri, and V. Vishal, “ESCAPE: efficiently count-
for finding simple paths, cycles and other small subgraphs within ing all 5-vertex subgraphs,” in WWW, 2017.
large graphs,” in STOC, 1994.
[20] S. Jain and C. Seshadhri, “A fast and provable method for estimat- Xiaowei Ye received the BE degree from Shan-
ing clique counts using turán’s theorem,” in WWW, 2017. dong University, China, in 2021, and is working
[21] D. W. Matula and L. L. Beck, “Smallest-last ordering and clustering toward the PhD degree at Beijing Institute of
and graph coloring algorithms,” J. ACM, vol. 30, no. 3, pp. 417– Technology (BIT), Beijing, China. His research
427, 1983. interests include subgraph counting, graph data
[22] E. Tomita, A. Tanaka, and H. Takahashi, “The worst-case time mining and social network analysis.
complexity for generating all maximal cliques and computational
experiments,” Theor. Comput. Sci., vol. 363, no. 1, pp. 28–42, 2006.
[23] M. Bressan, S. Leucci, and A. Panconesi, “Motivo: Fast motif
counting via succinct color coding and adaptive sampling,” Proc.
VLDB Endow., vol. 12, no. 11, pp. 1651–1663, 2019. Rong-Hua Li received the PhD degree from the
[24] M. Jha, C. Seshadhri, and A. Pinar, “Path sampling: A fast and Chinese University of Hong Kong, in 2013. He
provable method for estimating 4-vertex subgraph counts,” in is currently a professor with the Beijing Institute
WWW, 2015. of Technology (BIT), Beijing, China. Before join-
[25] P. Wang, J. Zhao, X. Zhang, Z. Li, J. Cheng, J. C. S. Lui, D. Towsley, ing BIT in 2018, he was an assistant professor
J. Tao, and X. Guan, “MOSS-5: A fast method of approximating with Shenzhen University. His research interest-
counts of 5-node graphlets in large graphs,” IEEE Trans. Knowl. s include graph data management and mining,
Data Eng., vol. 30, no. 1, pp. 73–86, 2018. social network analysis, graph computation sys-
[26] M. Bressan, F. Chierichetti, R. Kumar, S. Leucci, and A. Panconesi, tems, and graph-based machine learning.
“Motif counting beyond five nodes,” ACM Trans. Knowl. Discov.
Data, vol. 12, no. 4, pp. 48:1–48:25, 2018. Qiangqiang Dai is working toward the PhD de-
[27] S. Jain and C. Seshadhri, “Provably and efficiently approximating gree at Beijing Institute of Technology (BIT), Bei-
near-cliques using the turán shadow: PEANUTS,” in WWW ’20: jing, China. His research interests include graph
The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020, 2020, pp. data management and mining, social network
1966–1976. analysis, and graph computation systems.
[28] B. Balasundaram and S. Butenko, “Graph domination, coloring
and cliques in telecommunications,” in Handbook of Optimization in
Telecommunications. Springer, 2006, pp. 865–890.
[29] L. Chang and L. Qin, “Cohesive subgraph computation over large
sparse graphs,” in ICDE, 2019.
[30] V. Batagelj and M. Zaversnik, “An o(m) algorithm for cores de- Hongzhi Chen received his Ph.D. degree from
composition of networks,” CoRR, vol. cs.DS/0310049, 2003. the Department of Computer Science and Engi-
[31] C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos, neering, the Chinese University of Hong Kong, in
“DOULION: counting triangles in massive graphs with a coin,” 2020. He is currently a senior R.D. at ByteDance
in KDD, 2009. Infrastructure Team, Beijing, China, working on
[32] W. Hasenplaugh, T. Kaler, T. B. Schardl, and C. E. Leiserson, graph related storage, processing and training
“Ordering heuristics for parallel graph coloring,” in SPAA, 2014. systems. His research interests cover the broad
[33] L. Yuan, L. Qin, X. Lin, L. Chang, and W. Zhang, “Effective and area of distributed systems and databases, with
efficient dynamic graph coloring,” Proc. VLDB Endow., vol. 11, special emphasis on graph systems and ma-
no. 3, pp. 338–351, 2017. chine learning/deep learning systems.
[34] L. Li, “Discrete distributions,” 1972.
Guoren Wang received the BS, MS, and PhD
[35] P. Ribeiro, P. Paredes, M. E. P. Silva, D. Aparı́cio, and F. M. A. degrees from the Department of Computer Sci-
Silva, “A survey on subgraph counting: Concepts, algorithms, ence, Northeastern University, China, in 1988,
and applications to network motifs and graphlets,” ACM Comput. 1991, and 1996, respectively. Currently, he is a
Surv., vol. 54, no. 2, pp. 28:1–28:36, 2022. professor with the Beijing Institute of Technolo-
[36] M. Almasri, I. E. Hajj, R. Nagi, J. Xiong, and W. Hwu, “Parallel gy (BIT), Beijing, China. His research interest-
k-clique counting on gpus,” in ICS, 2022. s include graph data management and mining,
[37] T. Eden, D. Ron, and C. Seshadhri, “On approximating the number query processing and optimization, graph com-
of k-cliques in sublinear time,” in STOC, 2018. putation systems.
[38] ——, “Faster sublinear approximation of the number of k-cliques
in low-arboricity graphs,” in SODA, 2020.

Authorized licensed use limited to: University of Sydney. Downloaded on September 17,2023 at 23:38:39 UTC from IEEE Xplore. Restrictions apply.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like