GUISE Uniform Sampling of Graphlets For Large Graph Analysis Removed
GUISE Uniform Sampling of Graphlets For Large Graph Analysis Removed
1 2
3 4 5
g1 g2 g3 g4 g5 g6 g7 g8
6 7
5-node graphlets (a)
a b
c d
g9 g10 g11 g12 g13 g14 g15 g16 g17 g18 g19 (b)
a d
c b
(c)
g20 g21 g22 g23 g24 g25 g26 g27 g28 g29
of the
frequencies; thus each entry in the GFD becomes, condition below:
log 29f (i)+1
i=1 f (i)+29 π(i)T (i, j) = π(j)T (j, i), ∀i, j ∈ S (2)
Example: In the graph in Figure 3(a), the frequency of each
type of graphlets are 11, 3, 5, 3, 0, 6, 2, 0, 1, 2, 0, 2, 1, 2, 0, 0, The above condition is the sufficient condition for π to be a
2, 0, 1, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0. Here, 11 is the count stationary distribution of the Markov chain. A Markov chain
of g1 , 3 is the count of g2 , and so on. Using the is ergodic if it has a stationary distribution.
process discussed above, the GFD of this graph is:
III. M ETHOD
−0.8, −1.3, −1.1, −1.3, −1.9, −1, −1.4, −1.9, −1.6, −1.4,
− 1.9, −1.4, −1.6, −1.4, −1.9, −1.9, −1.4, −1.9, −1.6, As explained in Section I, a naive approach to generate
− 1.9, −1.9, −1.9, −1.9, −1.2, −1.9, −1.9, −1.9, −1.9, GFD is to count the frequencies of each graphlet, which re-
− 1.9 quires enumeration of all distinct induced embeddings. This
task becomes infeasible when the input graph is large. We
propose an efficient method that utilizes uniform sampling
D. Markov Chains
to approximate the GFD. Below, we discuss the method in
If we consider a random variable X that has a range of details.
values (states) that are defined in a state space S and assume
that Xt denotes the value(state) of X at time t(discrete). A. Uniform Sampling of graphlets for GFD Construction
This random variable is called a Markov process if the Given a graph G, assume the set S contains all the
transition probabilities between a pair of states in S depends (induced) embeddings
29 of all the graphlets in the graph G.
only on the current value(state) of X. A Markov Chain Then, |S| = i=1 f (i), where f (i) is the frequency of
is the sequence of Markov process over the state space S. graphlet i in the graph G. Then, the task of uniform sampling
The transition probabilities can be expressed in a matrix, of graphlets is to sample one of the graphlet embeddings
T , called Transition Probability Matrix. Each state in S in S uniformly at random; i.e., the selection probability of
occupies exactly one row and one column of T , in which each of the graphlet embeddings is exactly 1/|S|. The task
the entry T (i, j) is the probability of transition from state i is no harder than the enumeration of all the graphlets in
S. In fact, after enumerating all the graphlet embeddings,
state j. For all i, j ∈ S, we have 0 ≤ T (i, j) ≤ 1, and
to
j T (i, j) = 1.
we only need a random number generator to sample one of
A Markov chain is said to reach a stationary distribution those embeddings from an iid distribution. But, enumerating
π, when the probability of being in any particular state is all the graphlets is not practical, so we want to sample a
independent of the initial condition. This scenario can be graphlet uniformly without explicitly enumerating all the
indicated by the condition embeddings of all the graphlets, which is a challenging
task. Fortunately, the problems of above characteristics
π = πT (1) have been efficiently managed by Monte Carlo Markov
Chain (MCMC) algorithms for years. MCMC algorithms
π is a row vector of size |T |. Thus, the stationary distribution perform a random walk on the sample space with a locally
is the left eigen-vector of the matrix T with an eigenvalue of computable transition probability matrix in such a manner
1. We use π(i) to denote the i’th component of this vector. that the stationary distribution of the random walk aligns
A Markov chain in reversible if it satisfies the reversibility with the desired probability distribution. Once the random
658
94
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on May 22,2024 at 09:55:24 UTC from IEEE Xplore. Restrictions apply.
walk mixes, any object that the walk visits in the sample 0 1,2,3,4
6
space is considered to be a sample taken using the desired
1 5,6,7,8,9,10
probability distribution. For our task, the sample space is
1 2
the set S, and the desired probability distribution is the iid 2 5,6,7,8,10
5 7
distribution. 12 9
3 5,6,7,8,9,10
Before we discuss the details of the MCMC method for iid 11
659
95
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on May 22,2024 at 09:55:24 UTC from IEEE Xplore. Restrictions apply.
the probability of moving from the state i to the state j and Moreover, T is a transition probability matrix and is row
the probability of moving from the state j to the state i are stochastic, thus, T is doubly stochastic. Since an ergodic
equal. G UISE adopts this strategy, i.e., it uses a symmetric random walk has an exclusive stationary distribution, both
transition probability matrix T , i.e. T = T T . the necessary and sufficient condition for the proof is met.
For a graphlet i, its degree is defined as d(i), which is
the number of its total neighbors in the random walk space. Nevertheless, we still need to prove the following:
The usage of the term degree has an intuitive meaning. If Lemma 3: The random walk that G UISE uses is ergodic.
we consider each of the graphlet embeddings as a vertex
of a graph, and represent the neighbor relationship between P ROOF: A Markov Chain is ergodic if it converges to a
two graphlet embeddings as an edge, then the degree of stationary distribution. To obtain a stationary distribution
that embedding is exactly equal to its degree in the above the random walk needs to be finite, irreducible and aperi-
graph. In that case, the random walk can be viewed as a walk odic [18]. The state space S is finite with size m, because the
along the edges of the graph. Now, consider two neighboring number of graphlets is finite. We also assume that the input
graphlets, p and q; By setting T (p, q) and T (q, p) equal graph G is connected, so in this random walk any state y
1 1
to min( d(p) , d(q) ) makes T a symmetric matrix (note if, can be reachable from any state x with a positive probability
if p and q are not neighbors, T (p, q) = T (q, p) = 0, and vice versa, so the random walk is irreducible. Finally
thus maintains the symmetry). By definition, the row entries the walk can be made aperiodic by allocating a self-loop
of T requires to sum to 1; the above probability setting probability at every node 7 . Thus the lemma is proved.
ensures that the sum of row entries of all the rows are equal 3) Sampling Convergence: Convergence is an important
or less than 1. In case, it is less than 1, we allocate the issue for any MCMC sampling based algorithm. In such
remaining probability as a self-loop to the resident state. This sampling, the time the random process needs for the number
symmetric transition probability matrix is also called doubly of iterations to reach its stationary distribution is referred
stochastic, as the sum of both the rows and the columns of as the mixing time of any such random walk. The lesser
such matrix equal to 1. Now the following Lemma (Exercise the mixing time the better the sampling quality. The mixing
6.9 of [18]) is sufficient to prove that the above Markov chain time can be estimated by computing the spectral gap [19] of
achieves a uniform stationary distribution. the transition probability matrix T . It is also related to the
Lemma 2: An ergodic random walk achieves a uniform diameter of the transition graph. Since, most of the social
stationary distribution if and only if its transition probability network graphs have very small diameter, the diameter of
matrix is doubly stochastic. the transition graph is also small, so the mixing time of our
random walk is small, and the sampling process converges
P ROOF:According to the Equation 1 we have very fast. We skip the detailed discussion of this for the sake
π = πT (3) of brevity.
Which means the sum of the column vectors of T is equal to 7 This is required only from a theoretical standpoint; in our experiment
1 for each column of the matrix, i.e. T is column stochastic. we do not allocate any self-loop probability, unless needed.
660
96
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on May 22,2024 at 09:55:24 UTC from IEEE Xplore. Restrictions apply.
G UISE (G, SCount): B. Identifying graphlet type
1. gx = get a initial graphlet(G)
To identify the type of a graphlet gt , G UISE first treats gt
2. dgx = populate neighborhood(gx )
3. sampled = 0 as a graph g(v, e) where cardinality of v and e is between
4. while (true) 3 to 5 and 2 to 10, respectively. First of all, graphlets can
5. choose a neighbor gy , uniformly from, be categorized based on the cardinality of v. To distinguish
all possible neighbors graphlets in each category we introduce a signature of
6. dgy = populate neighborhood(gy )
|d | each graphlet based on the degree count of each vertex in
7. acceptance probability = min ( |dggx | , 1) g(v, e). We denote this signature as degree-signature. We
y
8. if unif orm(0, 1) ≤ acceptance probability first compute the degree of each vertices of graph g(v, e)
9. gx = gy
10. dgx = dgy
and save it in a vector of size |v|. Then we sort the vector
11. sampled+ = 1 and use this sorted vector as a signature of each graphlet.
12. get graphlet type(gx )+=1 G UISE finds this signature and based on the signature it
13. if(sampled > SCount) identifies which type of graphlet gt is. It is possible that two
14. return distinct type of graphlets have the same degree-signature; in
populate neighborhood(gx ): that case G UISE checks additional criteria to make them
1. neighbor list = generate all potential distinguishable. Please note that, above scenario occur only
neighboring graphlets for two pairs of graphlets, (g13 , g16 ) and (g20 , g21 ).
2. return neighbor list
Figure 5: Uniform Sampling Algorithm Example: Lets take three graphlets, g12 , g22 and g26 .
These are all size 5 graphlets. For g12 we can easily
ability min(|dgx |/|dgy |, 1). So, the transition probability compute the degree of all vertices, save it in a vector
T (gx , gy ) = min(1/|dgx |, 1/|dgy |). and after sorting the vector we get degree-signature of g12
which is (1, 1, 2, 3, 3). In a similar fashion we can get
IV. I MPLEMENTATION D ETAILS degree-signature of g22 and g26 , which are (2, 2, 2, 4, 4) and
G UISE accepts a connected graph G for which it computes (2, 3, 3, 4, 4), respectively. As we can see, degree-signatures
the GFD. It starts from a random graphlet embedding (say are unique for all three graphlets. Similar trend holds for
gt ) —it can simply be an embedding of a g1 graphlet which other graphlets except two pairs of graphlets we mention
is easy to get. Then it computes the transitional probability above. For g13 and g16 , both of them have (1, 2, 2, 2, 3) as
matrix T locally, which requires the knowledge of degree their degree-signature. We can easily make them distinguish-
of gt ; it is also important to know the graphlet-type of gt , able if we hop in to their structural level. Note that, for
so that the correct counter can be incremented. g13 minimum degree count node has only neighboring node
with degree count 2 but for g16 it has neighboring node with
A. Populating the neighborhood of a graphlet degree count 3. Using the similar trick, we can make g20 and
Populating the neighborhood of a graphlet is the most g21 distinguishable.
time-consuming task. In the following we will explain how
G UISE populates the neighborhood of a 4-Graphlet gt C. Complexity analysis
To obtain a 3-Graphlet, G UISE first deletes one of the
Most expensive part of G UISE is neighborhood compu-
vertices from gt and checks whether the remaining 3 vertices
tation. For neighborhood computation we need to perform
are still connected in the input graph. If yes, a 3-Graphlet
union operations on adjacency lists of current graphlet. Our
neighbor of gt is obtained. Note that, gt can have (at most)
assumption that the adjacency lists are stored in sorted order
4 such neighboring graphlets. To obtain all neighboring
allows us to perform union operation in linear time with
4-Graphlet of gt , G UISE first removes one of the vertices
respect to the length of the participating sets (adjacency
from gt and checks whether the remaining 3 vertices are
list). For example, the cost for performing union over two
still connected. If succeeds, it finds the union of adjacent
adjacency lists of size m and n is O(m + n).
vertices of these 3 vertices. Each vertex of the resultant
The worst case time for finding the neighbors of a
set of vertices along with the 3 undeleted vertices (of gt )
3-Graphlet is O(9p) (3 ∗ O(2p) + O(3p)); where, p is the
represents a neighbor of gt . The process of removing and
average length of adjacency lists. Similarly, for 4-Graphlet
combining is repeated for all the vertices of gt . Finally, to get
and 5-Graphlet the time complexities are O(24p) (4 ∗
all neighboring 5-Graphlets of gt , G UISE takes the union
O(2p) + 4 ∗ O(3p) + 2 ∗ O(2p)) and O(30p) (5 ∗ O(2p) + 5 ∗
of adjacent lists of all 4 of its vertices and pick a vertex
O(4p)) respectively. The total execution time for all itera-
from the union set and combine with gt .
tions is (O(9p) ∗ |3-Graphlets| + O(24p) ∗ |4-Graphlets| +
Following the above techniques, G UISE can populates
O(30p)∗ |5-Graphlets|)). Where, |k-Graphlets| represents
neighborhood of size 3, 4 and 5 graphlets.
the number of k-Graphlet embeddings sampled.
661
97
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on May 22,2024 at 09:55:24 UTC from IEEE Xplore. Restrictions apply.