0% found this document useful (0 votes)

64 views

Complex Network Report

the result of the term project to compare different motif detection software available in the market

Uploaded by

Mohit Sharma

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

Complex Network Report

the result of the term project to compare different motif detection software available in the market

Uploaded by

Mohit Sharma

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

IIT KHARAGPUR

Complex Network
Comparing Motif Detection Algorithms/tools

Mohit Sharma 10CS30028

Imran 10CS30021

4/24/2013
Under the Guidance of:
Rishiraj Saha Roy
Prof. Animesh Mukherjee

Ashwahni Attri 10CS30010

Introduction
All networks, including biological networks (e.g., metabolic networks, transcription regulatory networks, protein-protein
interaction networks, protein structure networks, neural networks, ecological networks), social networks, technological
networks (e.g., computer networks, electrical circuits), etc., can be represented as graph, which include a wide variety of
subgraphs. One important local property of networks are so-called Network Motifs, which are defined as recurrent and
statistically significant sub-graphs or patterns. Motifs, sub-graphs that repeat themselves in a specific network or even
among various networks, would be consistent with the tenets of evolutionary theory. Each of these sub-graphs, defined by a
particular pattern of interactions between vertices, may reflect a framework in which particular functions are achieved
efficiently. Indeed, motifs are of notable importance largely because they may reflect functional properties. They have
recently gathered much attention as a useful concept to uncover structural design principles of complex
networks.[1] Although network motifs may provide a deep insight into the networks functional abilities, their detection is
computationally challenging.

Objective
In this term project, we wish to explore several state-of-the-art motif finding algorithms and compare them on aspects like
scalability, ease of integration with existing code modules, and apparent drawbacks. Comparative interpretation of results
can be insightful with respect to understanding how these algorithms work. Apart from this, we will also try to gain some
insight into the semantic properties of the network motifs in the query log.
Algorithms and Tools

FANMOD
Kavosh
Biemann
MAVisto
Kashtan(mfinder)

Motifs Considered
Size 3 Motifs:

3-Clique

3-semi-clique

Size 4 Motifs:

4-Chain
4-Box

4-semi-clique
4-loop-out

4-clique

4-star

QLN: Definition and Construction

A Query Log Network (QLN) for any given Query Log is defined as a network N : (N,E), where N is the set of nodes each
labeled by a unique word and E the set of edges. Two nodes {i, j} N are connected by an edge (i, j) E if and only if i and j
co-occur in a sentence. Co-occurrence can be defined in many ways: here we have used local co-occurrence:
Local co-occurrence: According to this model of QLN, immediate word neighborhood is considered important and an edge is
added between two words if they occur within a distance of two (i.e. separated by zero or one word) in a query.
For QLN ,edges resulting from random collocations are pruned using edge restriction as follows. Let i and j be two distinct
words from the log. Let pi, pj and pij be the probabilities of occurrence of i, j and pair (i,j) in the log, respectively. Then, in a
restricted network, an edge exists if and only if pij > pi*pj . All networks considered in this study are undirected and
unweighted.
Normalization
Since motifs counts are dependent on the size of a network, they must be suitably normalized. We normalize a motif count
by the expected count of that motif for an Erdos-Renyi (ER) random graph model [20] with the same number of nodes and
edge density [60]. Since the ratios of the probabilities can be very skewed, we take the natural logarithm of this quantity,
which we shall refer to as the Log Normalized Motif Count or LNMC. Thus,
LNMC (
where

) = loge

is the ith n-size motif.

Results
We present here the raw counts of the motifs and the LNMC for networks of various sizes. We also include the running time
of the 5 tools/algorithms used.

Query Log 1 (size:2029 queries)

Raw Counts:
3-chain

3-clique

4-chain

4-star

4-loopout

4-box

4-semi
clique

4-clique

Beimann

39397

1882

283595

732187

76309

1504

6187

512

Mfinder

39397

1882

283595

732187

76309

1504

6187

512

Kavosh

95.44%

4.56%

25.77%

66.54%

6.94%

0.08%

0.539219%

0.069618%

FANMOD

39397

1882

283595

732187

76309

1504

6187

512

Mavisto

45043

1882

485495

822918

107201

9227

9259

512

Table 1

Running Time:
Size 3

Size 4

Beimann

_______

Mfinder

5.67 min

1.08 hrs

Kavosh

9.09 sec

90.67 sec

FANMOD

_______

Mavisto

12 min 40 sec

2 hr 12 sec

Table 2

Normalized Count:
3-chain

3- clique

4-star

4-loopout

4-chain

4-box

4-semiclique

4-clique

Beimann

15.918035

6.461399

18.180713

15.919468

11.627872

11.992805

19.640088

24.256635

Mfinder

15.918035

6.461399

18.180713

15.919468

11.627872

11.992805

19.640088

24.256635

Kavosh

15.918027

6.46157

18.181255

15.920754

11.628311

11.457723

19.598743

24.659851

FANMOD

15.918035

6.461399

18.180713

15.919468

11.627872

11.992805

19.640088

24.256635

MAVISTO

15.923766

6.333202

18.032217

15.994066

11.900178

13.541494

19.777916

23.991318

Table 3

Query Log 2 (size: 1000 queries)

Raw Count:
3-chain

3-clique

4-star

4-loopout

4-chain

4-box

4semiclique

4-clique

Beimann

52097

979

2934041

73010

226949

675

3107

Mfinder

52097

979

2934041

73010

226949

675

3107

Kavosh

98.15547%

1.844525%

90.61647%

2.02154%

7.00921%

0.01655%

0.087897%

0.010686%

FANMOD

52097

979

2934041

73010

226949

675

3107

Table 4

Running Time:
Size 3

Size 4

Beimann

________

_______

Mfinder

3 min 9 sec

1 hr 2 min

Kavosh

2.48 sec

3 min 56.4 sec

FANMOD

_______

Table 5

Normalized Count:
3-chain

3-clique

4-star

4-loopout

4-chain

4-box

4-semiclique

4-clique

Beimann

9.035738

11.227082

12.144607

13.805812

14.939941

9.122173

16.632226

19.892286

Mfinder

9.035738

11.227082

12.144607

13.805812

14.939941

9.122173

16.632226

19.892286

Kavosh

9.035738

11.227082

12.145177

13.782722

14.940511

8.892158

16.545049

21.296640

FANMOD

9.035738

11.227082

12.144607

13.805812

14.939941

9.122173

16.632226

19.892286

Table 6

Query Log 3(size: 10,000 queries)

Raw Count:
3-chain

3-clique

4-star

4-loopout

4-chain

4-box

4-semiclique

4-clique

Beimann

1837412

10411

565434017

3930501

12192850

29157

51143

1436

Mfinder

1837412

10411

565434017

3930501

12192850

29157

51143

1436

Kavosh

97.43658%

0.563420%

96.27968%

0.635352%

2.096291%

0.03304%

0.4264%

0.001044%

FANMOD

1837412

10411

565434017

3930501

12192850

29157

51143

1436

Table 7

Running Time:
Size 3

Size 4

Beimann

88 sec

733 sec

Mfinder

3 hrs 12 min

1.54 days

Kavosh

1 min 19.18 sec

12.804947222 hrs

FANMOD

1 sec

601 sec

Table 8

Normalized Counts:
3-chain

3-clique

4-star

4-loopout

4-chain

4-box

4-clique

4-semiclique

Beimann

13.045956

16.045157

18.202608

20.595303

21.727386

15.691476

24.243536

29.536371

Mfinder

13.045956

16.045157

18.202608

20.595303

21.727386

15.691476

24.243536

29.536371

Kavosh

9.048590

10.061333

18.202800

21.73723

20.54348

15.302440

23.5296

30.98809

FANMOD

13.045956

16.045157

18.202608

20.595303

21.727386

15.691476

24.243536

29.536371

Table 9

Query Log 4 (size: 1,00,000 queries)

Raw Count:
3-chain

3-clique

4-star

4-loopout

4-chain

4-box

4-semiclique

4-clique

Beimann

24360076

145737

8194654585

106982250

934755823

3535399

1832721

39777

Mfinder

24360076

145737

8194654585

106982250

934755823

3535399

1832721

39777

Kavosh
FANMOD

24360076

145737

8194654585

106982250

934755823

3535399

1832721

39777

Table 10

Running Time:
Size 3

Size 4

Beimann

1 min 7 sec

5 hrs 26 min

Mfinder

1.132 days

12.474 Days

Kavosh

16.603 min

7.359 Days

FANMOD

26 sec

3.694 hrs

Table 11

Normalized Count:
3-chain

3-clique

4-star

4-loopout

4-chain

4-box

4-semiclique

4-clique

Beimann

15.111847

19.199322

21.209101

25.265969

27.433591

21.856132

30.223162

36.292410

Mfinder

15.111847

19.199322

21.209101

25.265969

27.433591

21.856132

30.223162

36.292410

Kavosh
FANMOD
Table 12

15.111847

19.199322

21.209101

25.265969

27.433591

21.856132

30.223162

36.292410

Analysis
It was expected that the result of all algorithms/tools used in this project would be different for the larger query size.
But all the results had 100 percent match. So may be , we need still larger network to find this difference. Here is a summary
of out conclusions about the algorithms/tools used:

Kavosh ran for network generated from Query Log of size 1,00,000 but gave meaningless results(percentage greater
than 100.)
FANMOD , Beimann , Kashtan(mfinder) are scalable atleast upto network generated from a Query Log of size atleast
1,00,000.
Motif count given by all the algorithms are same for network generated from Query Log of size atleast upto 1,00,000.

Now we provide a summary of comparison of all the algorithms/tools based on the points of comparison specified in the
objective:

Accuracy

Runtime

For the network sizes we considered, all gave the same results.
o
o
o

Scalability
o
o

FANMOD runs the fastest among all others.

Beimann also has a runtime close to FANMOD.
Kashtan(mfinder) was the slowest.

FANMOD, Beimann and Kashtan(mfinder) are scalable atleast upto (30853,143306) size network.
Kavosh is scalable atleast upto (6331,16957) size network.

Ease of integration with existing code

From our experience of using the implementations:
o

FANMOD:
Implemented in popular graph libraries like igraph and NetworkX .
Can be used in various languages like C, Python, R.

Kashtan(mfinder):
Implemented in C.
Easy to use.
Kavosh:
Implemented in C
Easy to use.
Beimann:
Implemented in Java
Code has to be modified to find motifs belonging to different classes.

Mavisto:
Available as a JAVA Webstart Application
Difficult to use in some Environments.

Anomalies in Motif Count

The relative count of motifs of different classes computed for QLN was counter-intuitive. For example, one would expect
each 4-chain to contribute 2 3-semi-cliques(3-chains). But, the results showed that number of 4-chains was way too much
larger than the number of the 3-smi-clique. Here we identify various such cases and provide examples to explain the results
obtained. All the counts included for the explanation are from Table 7 (Query Log Sample Size of 10,000).
1. #3-semi-clique > #4 chain :
Since a single 4-chain Motif contains two 3-semi-clique,so one would expect the count of 3-chain will more than
double of count of 4-chain.

This 4-chain contains two 3-semi-cliques (1 and 2).

But, results showed some major anomaly. From QLN, we obtained 1,21,92,850 4-Chains but only 18,37,412 3-semicliques(3-chain). But here is an example which shows that these numbers could be justified.

# 4-chain
: 16
# 3-semi-clique : 8
These kind of structure is expected to be abundant in a real network where community structure is prevalent.
2. #3-clique > #4-loop out
Since a single 4-loop out Motif contains one 3-clique (triangle) ,so one would expect the count of 3-clique(triangle)
will more than count of 4-loopout.

This 4-loop-out contains one 3-clique (A-B-D)

But for QLN we obtained 39,30,501 4-loop-outs but only 10,411 3-clique(triangles). Here is an example which
proves that these numbers could be justified.

# 4-loopout : 9
# 3-clique : 1
This kind of structure would also be prevalent in a real network where community structure is prevalent.
3. #3-chain > 4-star
Since a single 4-Star Motif contains 3 3-chain,so one would expect the count of 3-chain will atleast equal to thrice the
count of 4-Star.

This 4-star contains three 3-semi-cliques (C-A-B , C-A-D, D-A-B).

But, for QLN, we obtained 56,54,34,017 4-star motifs but only 18,37,412 2-semi-clique motifs. Here is an example
which proves that these numbers could be justified.

# 4-star
: 56
# 3-semi clique : 28
This kind of structure will be prevalent in a real network where some nodes have very high degree. In our QLN, the
nodes corresponding to conjunctions will play this role.

4. #3-clique > #4-Semiclique

Since a single 4-Semiclique Motif contains two 3-clique(triangle),so one would expect the count of 3-clique will more
than twice the count of 4-Semiclique.

This 4-semi-clique contains two 3-semi-cliques (A-B-C , A-D-C).

But for the QLN, we obtained, 51,143 4-semi-clique motifs but only 10,411 3-semi-clique(triangle). Here is an
example which proves these numbers could be justified.

# 4-semi-clique : 9
# 3-semi-clique: 6
This kind of structure is going to arise in a real network where some nodes have very high degree. In our QLN, the
nodes corresponding to conjunctions will play this role.

Analysis of Motif Structure in Query Log Network

Motifs in a network are the basic functional units of the network. We tried to verify this hypothesis by finding the words
corresponding to nodes in motifs of different classes. The results were very interesting as words occurring motifs of
particular classes were related to each other in specific way. Here we provide some examples we obtained and try to derive
some conclusions about semantic relatedness of words that occur in the motifs.
3-Semiclique

nfl scores current

upgrade computer worm

download free instructions

beach wedding calculation

As the motif structure suggests, the middle nodes connect two words which one would not expect to come together in a
query. For example, in motif , upgrade-computer-worm, computer connects two words upgrade and word, which normally
would not come together in a query.
3-Clique :

cheapest internet service

mayo clinic ocd

air travel information

forearm muscle pain

As the network structure of a 3-clique suggests, it represents a set of three words which have high probability of coming
together in a query. For example ,in motif -mayo-clinic-ocd- all words have strong relation to each other. Mayo is a clinic in
USA with facilities for treatment of OCD(Obsessive-Compulsive Disorder)
4-Chain :

space satellite states united

patriot point south Korean

Semantic polysemy refers to the phenomenon that a word, denoted as a string of characters, can have different denotations in
different contexts . In co-occurrence networks, polysemy leads to chains: ambiguous words connect words that are not
connected to each other, and act as a bridge between different topical word clusters. In a chain of length four, one more
word from a topical cluster is observed, which does not connect to the polysemous word since it seems that their
occurrences are deemed rather independent by the significance measure. For example, in motif patriot-point-south-Korea ,
south is used in two sense, with point, it signifies a direction while , South Korea is a separate word in itself.
4-Box Motifs

machines elliptical machine eliptical

brother hospital brothers center

annuity index annuities indexed

garden raise gardens bed

Synonymy means that different words refer to the same concept. Two words are synonyms if they can be used
interchangeably without changing the meaning, but there are also rather syntactic variants of words that refer to the same
concept, such as nominalizations of adjectives or verb forms of different inflections. For example, in motif , annuity-indexannuities-indexed , annuity and annuities can be used interchangeably depending upon the usage and the same could be said
about the index and indexed.
4Star

Cheap access isp car

Texas border houston boarder

Espn basketball athletics sports

Poker online live results

A query could be logically divided into two parts, content and intent. Content represents the theme of the query and intent
part is added to specifically add information required about the content part. Initially we thought a 4-star represents a
content word surrounded by three intent word as in case of motif Poker-online-live-result. Here Poker is the content part
and online, live and results are the intent part. But, what we observed is , an intent word could also the center of the 4-

star , surrounded by three content word as in case of cheap-access-isp-car . Here, cheap is the intent part and access, isp
and car are the content parts.
4- Loop Out :
score espn football sat

cheap car parts access

video games playstation camera

service provider internet phone

A 4-loop out represents a set of words in which first three are connected to each other and the last one is connected to one of
the first three. In a QLN, this means that first three words have high probability of occurring together in a query and the last
word will come with word it is connected to. For example, in motif score-espn-football-sat , score, espn and football are
likely to occur together in a query while sat will occur only with score , and not with any other word.
4-SemiClique

keyboard mouse reviews customer

store locations discovery kids

wedding outdoor budget calculation

works

Michigan

job

listing

A 4-semi-clique represents a structure where all except one pair of nodes are disconnected. In the above examples , the only
pair of unrelated words are second and the last words, that are (mouse, customers) , (outdoor, calculation) ,(locations, kids)
,(michigan, listing)
4-Clique

university houston texas rice

computer upgrade ram memory

america president united of

railroad southern city Kansas

A 4-clique represents a group of words which are very likely to come together in a single query. The examples we obtained
clearly proves this point.

References
1. Chris Biemann, S tefanie R oos, KarstenW eihe,Q uantifying S emantics using Complex N etwork
Analysis.
2. Sebastian W ernicke, A Faster Algorithm for Detecting N etwork Motifs.
3. Rishiraj SahaR oy and NiloyGanguly , Smith Agarwal, MonojitChoudhury, Structural Complexity
of Web Search Queries through the Lenses of Language Models, Networks and Users.
4. Falk S chreiberand Henning S chwobbermeyer, M AVisto: A toolfor the exploration of network
motifs.
5. N . Kashtan , S .itzkovitz, R . M ilo and U . Alon, Efficient sampling algorithm for estimating
subgraph concentrations and detecting network motifs.
6. Zahra Razaghi Moghadam Kashani, HayedehAhrabian, ElaheElahi, Abbas Nowzari-Dalini, Elnaz
SaberiAnsari, SaharAsadi, Shahin Mohammadi, Falk Schreiberand Ali Masoudi-Nejad , Kavosh: a
new algorithm for finding network motifs.
7. M AVIS T O --http://mavisto.ipk-gatersleben.de/
8. www.wikpedia.org