0% found this document useful (0 votes)
9 views4 pages

GUISE Uniform Sampling of Graphlets For Large Graph Analysis Removed

Uploaded by

Anurag Gautam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views4 pages

GUISE Uniform Sampling of Graphlets For Large Graph Analysis Removed

Uploaded by

Anurag Gautam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

3-node graphlets 4-node graphlets

1 2

3 4 5
g1 g2 g3 g4 g5 g6 g7 g8
6 7
5-node graphlets (a)

a b

c d
g9 g10 g11 g12 g13 g14 g15 g16 g17 g18 g19 (b)

a d

c b
(c)
g20 g21 g22 g23 g24 g25 g26 g27 g28 g29

Figure 2: All 3,4,5-node graphlets Figure 3: sub-graph and


vertex induced sub-graph

of the
 frequencies;  thus each entry in the GFD becomes, condition below:
log 29f (i)+1
i=1 f (i)+29 π(i)T (i, j) = π(j)T (j, i), ∀i, j ∈ S (2)
Example: In the graph in Figure 3(a), the frequency of each
type of graphlets are 11, 3, 5, 3, 0, 6, 2, 0, 1, 2, 0, 2, 1, 2, 0, 0, The above condition is the sufficient condition for π to be a
2, 0, 1, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0. Here, 11 is the count stationary distribution of the Markov chain. A Markov chain
of g1 , 3 is the count of g2 , and so on. Using the is ergodic if it has a stationary distribution.
process discussed above, the GFD of this graph is:
III. M ETHOD
−0.8, −1.3, −1.1, −1.3, −1.9, −1, −1.4, −1.9, −1.6, −1.4,
− 1.9, −1.4, −1.6, −1.4, −1.9, −1.9, −1.4, −1.9, −1.6, As explained in Section I, a naive approach to generate
− 1.9, −1.9, −1.9, −1.9, −1.2, −1.9, −1.9, −1.9, −1.9, GFD is to count the frequencies of each graphlet, which re-
− 1.9  quires enumeration of all distinct induced embeddings. This
task becomes infeasible when the input graph is large. We
propose an efficient method that utilizes uniform sampling
D. Markov Chains
to approximate the GFD. Below, we discuss the method in
If we consider a random variable X that has a range of details.
values (states) that are defined in a state space S and assume
that Xt denotes the value(state) of X at time t(discrete). A. Uniform Sampling of graphlets for GFD Construction
This random variable is called a Markov process if the Given a graph G, assume the set S contains all the
transition probabilities between a pair of states in S depends (induced) embeddings
29 of all the graphlets in the graph G.
only on the current value(state) of X. A Markov Chain Then, |S| = i=1 f (i), where f (i) is the frequency of
is the sequence of Markov process over the state space S. graphlet i in the graph G. Then, the task of uniform sampling
The transition probabilities can be expressed in a matrix, of graphlets is to sample one of the graphlet embeddings
T , called Transition Probability Matrix. Each state in S in S uniformly at random; i.e., the selection probability of
occupies exactly one row and one column of T , in which each of the graphlet embeddings is exactly 1/|S|. The task
the entry T (i, j) is the probability of transition from state i is no harder than the enumeration of all the graphlets in
S. In fact, after enumerating all the graphlet embeddings,
 state j. For all i, j ∈ S, we have 0 ≤ T (i, j) ≤ 1, and
to
j T (i, j) = 1.
we only need a random number generator to sample one of
A Markov chain is said to reach a stationary distribution those embeddings from an iid distribution. But, enumerating
π, when the probability of being in any particular state is all the graphlets is not practical, so we want to sample a
independent of the initial condition. This scenario can be graphlet uniformly without explicitly enumerating all the
indicated by the condition embeddings of all the graphlets, which is a challenging
task. Fortunately, the problems of above characteristics
π = πT (1) have been efficiently managed by Monte Carlo Markov
Chain (MCMC) algorithms for years. MCMC algorithms
π is a row vector of size |T |. Thus, the stationary distribution perform a random walk on the sample space with a locally
is the left eigen-vector of the matrix T with an eigenvalue of computable transition probability matrix in such a manner
1. We use π(i) to denote the i’th component of this vector. that the stationary distribution of the random walk aligns
A Markov chain in reversible if it satisfies the reversibility with the desired probability distribution. Once the random

658
94

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on May 22,2024 at 09:55:24 UTC from IEEE Xplore. Restrictions apply.
walk mixes, any object that the walk visits in the sample 0 1,2,3,4
6
space is considered to be a sample taken using the desired
1 5,6,7,8,9,10
probability distribution. For our task, the sample space is
1 2
the set S, and the desired probability distribution is the iid 2 5,6,7,8,10
5 7
distribution. 12 9
3 5,6,7,8,9,10
Before we discuss the details of the MCMC method for iid 11

sampling of a graphlet embedding, we discuss, given a uni- 3 4 4 5,6,8,9

form sampler how G UISE constructs the graphlet frequency 8 10 5 5,6,7,8,9,10


distribution (GFD) effectively. The process is quite simple, (a) (b)
for this G UISE keeps one counter for each of the graphlets,
Figure 4: Neighborhood population of current graphlet
a total of 29 counters all initialized to 1. Then G UISE calls
(1,2,3,4)
the sampler repeatedly for a large number of iterations. For
each iteration, if the sampled embeddings is an embedding 4-Graphlet or 5-Graphlet as neighbors. To obtain a same-
of graphlet i, the algorithm increments the counter for i. size neighboring graphlet of a graphlet embedding, e, G UISE
G UISE constructs GFD by normalizing the values of each simply replaces one of the existing vertex of e with another
of the counters, and taking the logarithm of those values in vertex which is not part of e, which ensures the connect-
a vector in the correct order. The following Lemma holds. edness of the new embedding. Since, the new embedding
Lemma 1: when the size of the sample set, C approaches is connected, it embeds one of the size-k graphlets. For
to infinity, G UISE returns the correct GFD for a graph. obtaining a k-Graphlet from a (k − 1)-Graphlet G UISE
adds one embedding vertex, and for the reverse it deletes
P ROOF: Since, each sample returns one of the 29 graphlets one embedding vertex, again ensuring the connectivity of
using a uniform distribution, the random variable (say, X) the embeddings for both the actions. Note that, for a visiting
that defines the type of graphlet returned in an iteration embedding, G UISE populates its neighborhood locally by
follows a categorical distribution, with Pr(X = gi ) = pi = using the adjacency list information of the constituting
f (i)
29 , where f (i) is the frequency of gi in G. Also note vertices of its embedding.
i=1 f (i)
that the i’th entry of GFD is log pi .
In set C, the expected count for graphlet gi is |C| · pi . Example: Suppose G UISE is performing an MCMC walk
Now, if c(i) is the actual
 count of gi in C, using
 strong law on the graph shown in Figure 4(a). Let 1, 2, 3, 4 be the
of large numbers Pr lim|C|→∞ c(i) = |C| · pi = 1 currently visiting graphlet (g5 graphlet that is shown in bold
So, As |C| approaches to infinity, the i’th entry of GFD line) of size 4. In Figure 4(b) we show the information of
|C|·pi
using G UISE is equal to log c(i) |C| = log |C| = log pi all its neighbors of size 3, 4 and 5. The box labeled by
Therefore, for the limiting case, G UISE returns the correct 0 contains the vertices that can be deleted to get all valid
GFD for a graph.  neighboring 3-Graphlets. Box labeled by 1 contains all the
vertices that can be used as a replacement of vertex 1 to get
B. MCMC algorithm for Uniform Sampling of a Graphlet valid neighboring 4-Graphlets. Same is true for the box
labeled by 2,3 and 4. The box labeled by 5 contains all the
For any MCMC algorithm, we need to define the sample
vertices that can be added with the current graphlet to get
space, the state transition process, the transition probability
all valid neighboring 5-Graphlets. If the random walk of
matrix, and the desired probability distribution. As men-
G UISE chooses to go to the neighboring graphlet by simply
tioned earlier, the set of states are the embeddings of any of
adding the vertex 8 (a vertex in the box labeled by 5), the
the 3-, 4-, or 5-Graphlets, on which G UISE performs the
next sampled graphlet becomes g21 . 
random walk. Let’s call this S. At any time of the random
2) Transition Probability Matrix: The transition probabil-
walk, the G UISE visits a specific object in S. It then walks
ity matrix, T defines the state transition probability between
to one of the neighboring states with the probability that is
a pair of neighboring graphlets p and q. The transition
defined by an appropriate state transition probability matrix,
probability between two graphlets that are not neighbors of
T.
each other is zero. For example, in the graph in Figure 4,
1) Neighboring graphlets: For a k-Graphlet, all the the transition probability between 1, 2, 3, 4, and 1, 3, 5, 6
graphlets of size k − 1, k and k + 1 having k − 1, k − 1 is zero, as they are not neighbors of each other according to
and k nodes in common, respectively are its neighboring the neighborhood definition in the previous paragraph.
graphlets. In our case, k + 1 cannot be higher than 5 and
Note that if the random walk achieves a stationary dis-
k − 1 cannot be lower than max(3, k − 1), which means a
tribution π, then the Equation
 1 1 1 holds.  For our case, the
3-Graphlet can have 3-Graphlet or 4-Graphlet as neigh- 1
stationary distribution is m , m, · · · , m , where m = |S|,
bors; a 4-Graphlet can have 3-Graphlet or 4-Graphlet
which is a uniform vector of size m. One way to ensure the
or 5-Graphlet as neighbors, and 5-Graphlet can have
uniformity in π is to design a symmetric Markov chain, i.e.

659
95

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on May 22,2024 at 09:55:24 UTC from IEEE Xplore. Restrictions apply.
the probability of moving from the state i to the state j and Moreover, T is a transition probability matrix and is row
the probability of moving from the state j to the state i are stochastic, thus, T is doubly stochastic. Since an ergodic
equal. G UISE adopts this strategy, i.e., it uses a symmetric random walk has an exclusive stationary distribution, both
transition probability matrix T , i.e. T = T T . the necessary and sufficient condition for the proof is met.
For a graphlet i, its degree is defined as d(i), which is 
the number of its total neighbors in the random walk space. Nevertheless, we still need to prove the following:
The usage of the term degree has an intuitive meaning. If Lemma 3: The random walk that G UISE uses is ergodic.
we consider each of the graphlet embeddings as a vertex
of a graph, and represent the neighbor relationship between P ROOF: A Markov Chain is ergodic if it converges to a
two graphlet embeddings as an edge, then the degree of stationary distribution. To obtain a stationary distribution
that embedding is exactly equal to its degree in the above the random walk needs to be finite, irreducible and aperi-
graph. In that case, the random walk can be viewed as a walk odic [18]. The state space S is finite with size m, because the
along the edges of the graph. Now, consider two neighboring number of graphlets is finite. We also assume that the input
graphlets, p and q; By setting T (p, q) and T (q, p) equal graph G is connected, so in this random walk any state y
1 1
to min( d(p) , d(q) ) makes T a symmetric matrix (note if, can be reachable from any state x with a positive probability
if p and q are not neighbors, T (p, q) = T (q, p) = 0, and vice versa, so the random walk is irreducible. Finally
thus maintains the symmetry). By definition, the row entries the walk can be made aperiodic by allocating a self-loop
of T requires to sum to 1; the above probability setting probability at every node 7 . Thus the lemma is proved. 
ensures that the sum of row entries of all the rows are equal 3) Sampling Convergence: Convergence is an important
or less than 1. In case, it is less than 1, we allocate the issue for any MCMC sampling based algorithm. In such
remaining probability as a self-loop to the resident state. This sampling, the time the random process needs for the number
symmetric transition probability matrix is also called doubly of iterations to reach its stationary distribution is referred
stochastic, as the sum of both the rows and the columns of as the mixing time of any such random walk. The lesser
such matrix equal to 1. Now the following Lemma (Exercise the mixing time the better the sampling quality. The mixing
6.9 of [18]) is sufficient to prove that the above Markov chain time can be estimated by computing the spectral gap [19] of
achieves a uniform stationary distribution. the transition probability matrix T . It is also related to the
Lemma 2: An ergodic random walk achieves a uniform diameter of the transition graph. Since, most of the social
stationary distribution if and only if its transition probability network graphs have very small diameter, the diameter of
matrix is doubly stochastic. the transition graph is also small, so the mixing time of our
random walk is small, and the sampling process converges
P ROOF:According to the Equation 1 we have very fast. We skip the detailed discussion of this for the sake
π = πT (3) of brevity.

Here, π a row vector of size m defines the uniform prob- C. Pseudo-Code


ability distribution of a random walk with m states. T is
The Pseudo-code of G UISE is given in Figure 5. It takes
the transition probability matrix. The above equation can be
two parameters, an input graph G and the total number
written as:
⎛ ⎞ of samples (SCount). G UISE starts by picking (Line
T (1, 1) · · · T (1, m) 1) any initial graphlet (gx ). In Line 2 it populates the
⎜ T (2, 1) · · · T (2, m) ⎟
⎜ ⎟ neighborhood of gx according to the technique discussed
(π1 , . . . πm ) = (π1 , . . . πm ) ⎜ .. .. .. ⎟
⎝ . . . ⎠ in subsection IV-A and save the neighbor list in a graphlet
T (m, 1) · · · T (m, m) data structure (dgx ). Then in an iterative way, it chooses a
(4) graphlet gy from gx ’s neighbors with uniform distribution
For any state i, and finds it’s neighbors (Line 5 and 6). After computing the

m acceptance probability in Line 7, if the move is accepted,
πi = πi ∗ T (j, i) G UISE replaces the current graphlet gx by gy (line 9 and
j=1 10), otherwise gx is kept unchanged. It also increments the
Since each state has equal probability in the stationary number of embedding sampled (sampled) and visit count
1
distribution, πi = m . so, of the current graphlet type by one (line 11 and 12). The
precess is repeated for at least SCount times.
1  
m m
1 Line 5 chooses a neighbor gy of gx with probabil-
= ∗ T (j, i) ⇒ T (j, i) = 1
m m j=1 ity 1/|dgx | and line 8 accepts that choice with a prob-
j=1

Which means the sum of the column vectors of T is equal to 7 This is required only from a theoretical standpoint; in our experiment
1 for each column of the matrix, i.e. T is column stochastic. we do not allocate any self-loop probability, unless needed.

660
96

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on May 22,2024 at 09:55:24 UTC from IEEE Xplore. Restrictions apply.
G UISE (G, SCount): B. Identifying graphlet type
1. gx = get a initial graphlet(G)
To identify the type of a graphlet gt , G UISE first treats gt
2. dgx = populate neighborhood(gx )
3. sampled = 0 as a graph g(v, e) where cardinality of v and e is between
4. while (true) 3 to 5 and 2 to 10, respectively. First of all, graphlets can
5. choose a neighbor gy , uniformly from, be categorized based on the cardinality of v. To distinguish
all possible neighbors graphlets in each category we introduce a signature of
6. dgy = populate neighborhood(gy )
|d | each graphlet based on the degree count of each vertex in
7. acceptance probability = min ( |dggx | , 1) g(v, e). We denote this signature as degree-signature. We
y
8. if unif orm(0, 1) ≤ acceptance probability first compute the degree of each vertices of graph g(v, e)
9. gx = gy
10. dgx = dgy
and save it in a vector of size |v|. Then we sort the vector
11. sampled+ = 1 and use this sorted vector as a signature of each graphlet.
12. get graphlet type(gx )+=1 G UISE finds this signature and based on the signature it
13. if(sampled > SCount) identifies which type of graphlet gt is. It is possible that two
14. return distinct type of graphlets have the same degree-signature; in
populate neighborhood(gx ): that case G UISE checks additional criteria to make them
1. neighbor list = generate all potential distinguishable. Please note that, above scenario occur only
neighboring graphlets for two pairs of graphlets, (g13 , g16 ) and (g20 , g21 ).
2. return neighbor list

Figure 5: Uniform Sampling Algorithm Example: Lets take three graphlets, g12 , g22 and g26 .
These are all size 5 graphlets. For g12 we can easily
ability min(|dgx |/|dgy |, 1). So, the transition probability compute the degree of all vertices, save it in a vector
T (gx , gy ) = min(1/|dgx |, 1/|dgy |). and after sorting the vector we get degree-signature of g12
which is (1, 1, 2, 3, 3). In a similar fashion we can get
IV. I MPLEMENTATION D ETAILS degree-signature of g22 and g26 , which are (2, 2, 2, 4, 4) and
G UISE accepts a connected graph G for which it computes (2, 3, 3, 4, 4), respectively. As we can see, degree-signatures
the GFD. It starts from a random graphlet embedding (say are unique for all three graphlets. Similar trend holds for
gt ) —it can simply be an embedding of a g1 graphlet which other graphlets except two pairs of graphlets we mention
is easy to get. Then it computes the transitional probability above. For g13 and g16 , both of them have (1, 2, 2, 2, 3) as
matrix T locally, which requires the knowledge of degree their degree-signature. We can easily make them distinguish-
of gt ; it is also important to know the graphlet-type of gt , able if we hop in to their structural level. Note that, for
so that the correct counter can be incremented. g13 minimum degree count node has only neighboring node
with degree count 2 but for g16 it has neighboring node with
A. Populating the neighborhood of a graphlet degree count 3. Using the similar trick, we can make g20 and
Populating the neighborhood of a graphlet is the most g21 distinguishable.
time-consuming task. In the following we will explain how
G UISE populates the neighborhood of a 4-Graphlet gt C. Complexity analysis
To obtain a 3-Graphlet, G UISE first deletes one of the
Most expensive part of G UISE is neighborhood compu-
vertices from gt and checks whether the remaining 3 vertices
tation. For neighborhood computation we need to perform
are still connected in the input graph. If yes, a 3-Graphlet
union operations on adjacency lists of current graphlet. Our
neighbor of gt is obtained. Note that, gt can have (at most)
assumption that the adjacency lists are stored in sorted order
4 such neighboring graphlets. To obtain all neighboring
allows us to perform union operation in linear time with
4-Graphlet of gt , G UISE first removes one of the vertices
respect to the length of the participating sets (adjacency
from gt and checks whether the remaining 3 vertices are
list). For example, the cost for performing union over two
still connected. If succeeds, it finds the union of adjacent
adjacency lists of size m and n is O(m + n).
vertices of these 3 vertices. Each vertex of the resultant
The worst case time for finding the neighbors of a
set of vertices along with the 3 undeleted vertices (of gt )
3-Graphlet is O(9p) (3 ∗ O(2p) + O(3p)); where, p is the
represents a neighbor of gt . The process of removing and
average length of adjacency lists. Similarly, for 4-Graphlet
combining is repeated for all the vertices of gt . Finally, to get
and 5-Graphlet the time complexities are O(24p) (4 ∗
all neighboring 5-Graphlets of gt , G UISE takes the union
O(2p) + 4 ∗ O(3p) + 2 ∗ O(2p)) and O(30p) (5 ∗ O(2p) + 5 ∗
of adjacent lists of all 4 of its vertices and pick a vertex
O(4p)) respectively. The total execution time for all itera-
from the union set and combine with gt .
tions is (O(9p) ∗ |3-Graphlets| + O(24p) ∗ |4-Graphlets| +
Following the above techniques, G UISE can populates
O(30p)∗ |5-Graphlets|)). Where, |k-Graphlets| represents
neighborhood of size 3, 4 and 5 graphlets.
the number of k-Graphlet embeddings sampled.

661
97

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY ROORKEE. Downloaded on May 22,2024 at 09:55:24 UTC from IEEE Xplore. Restrictions apply.

You might also like