Extraction and Classification of Dense Communities in The Web
Extraction and Classification of Dense Communities in The Web
Yon Dourisboure
Filippo Geraci
Marco Pellegrini
Istituto di Informatica e Telematica - CNR via Moruzzi, 1 Pisa, Italy
[email protected] ABSTRACT
[email protected] Keywords
The World Wide Web (WWW) is rapidly becoming important for society as a medium for sharing data, information and services, and there is a growing interest in tools for understanding collective behaviors and emerging phenomena in the WWW. In this paper we focus on the problem of searching and classifying communities in the web. Loosely speaking a community is a group of pages related to a common interest. More formally communities have been associated in the computer science literature with the existence of a locally dense sub-graph of the web-graph (where web pages are nodes and hyper-links are arcs of the web-graph). The core of our contribution is a new scalable algorithm for nding relatively dense subgraphs in massive graphs. We apply our algorithm on web-graphs built on three publicly available large crawls of the web (with raw sizes up to 120M nodes and 1G arcs). The eectiveness of our algorithm in nding dense subgraphs is demonstrated experimentally by embedding articial communities in the web-graph and counting how many of these are blindly found. Eectiveness increases with the size and density of the communities: it is close to 100% for communities of a thirty nodes or more (even at low density). It is still about 80% even for communities of twenty nodes with density over 50% of the arcs present. At the lower extremes the algorithm catches 35% of dense communities made of ten nodes. We complete our Community Watch system by clustering the communities found in the web-graph into homogeneous groups by topic and labelling each group by representative keywords.
1. INTRODUCTION
Why are cyber-communities important?. Searching for social structures in the World Wide Web has emerged as one of the foremost research problems related to the breathtaking expansion of the World Wide Web. Thus there is a keen academic as well as industrial interest in developing ecient algorithms for collecting, storing and analyzing the pattern of pages and hyper-links that form the World Wide Web, since the pioneering work of Gibson, Kleinberg and Raghavan [19]. Nowadays many communities of the real world that want to have a major impact and recognition are represented in the Web. Thus the detection of cybercommunities, i.e. set of sites and pages sharing a common interest, improves also our knowledge of the world in general. Cyber-communities as dense subgraphs of the web graph. The most popular way of dening cybercommunities is based on the interpretation of WWW hyperlinks as social links [10]. For example, the web page of a conference contains an hyper-link to all of its sponsors, similarly the home-page of a car lover contains links to all famous car manufactures. In this way, the Web is modelled by the web graph, a directed graph in which each vertex represents a web-page and each arc represents an hyper-link between the two corresponding pages. Intuitively, cyber-communities correspond to dense subgraphs of the web graph. An open problem. Monika Henzinger in a recent survey on algorithmic challenges in web search engines [26] remarks that the Trawling algorithm of Kumar et al. [31] is able to enumerate dense bipartite graphs in the order of tens of nodes and states this open problem: In order to more completely capture these cyber-communities, it would be interesting to detect much larger bipartite subgraphs, in the order of hundreds or thousands of nodes. They do not need to be complete, but should be dense, i.e. they should contain at least a constant fraction of the corresponding complete bipartite subgraphs. Are there ecient algorithms to detect them? And can these algorithms be implemented eciently if only a small part of the graph ts in main memory? Theoretical results. From a theoretical point of view, the dense k-subgraph problem, i.e. nding the densest subgraph with k vertices in a given graph, is clearly NP-Hard (it is easy to see by a reduction from the max-clique problem). Some approximation algorithms with a non constant approximation factor can be found in the literature for example in [24, 14, 13], none of which seem to be of practical applicability. Studies about the inherent complexity of the problem of obtaining a constant factor approximation algorithm are reported in [25] and [12].
General Terms
Algorithms, Experimentation Work partially supported by the EU Research and Training Network COMBSTRU (HPRN-CT-2002-00278) and by the Italian Registry of ccTLDit Works also for Dipartimento di Ingegneria dell nformazione, Universit di Siena, Italy a
Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 812, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005.
461
2. PREVIOUS WORK
Given the hypertext nature of the WWW one can approach the problem of nding cyber-communities by using as main source the textual content of the web pages, the hyperlinks structure, or both. Among the methods for nding group of coherent pages based only on text content we can mention [8]. Recommendation systems usually collect information on social networks from a variety of sources (not only link structure) (e.g. [29]). Problems of a similar nature appears in the areas of social network analysis, citation analysis and bibliometrics, where however, given the relatively smaller data sets involved (relative to the WWW), eciency is often not a critical issue [35]. Since the pioneering work [19] the prevailing trend in the Computer Science community is to use mainly the linkstructure as basis of the computation. Previous literature on the problem of nding cyber-communities using link-based analysis in the web-graph can be broadly split into two large groups. In the rst group are methods that need an initial seed of a community to start the process of community identication. Assuming the availability of a seed for a possible community naturally directs the computational eort in the region of the web-graph closest to the seed and suggests the use of sophisticated but computational intensive techniques, usually based of max-ow/min-cut approaches. In this category we can list the work of [19, 15, 16, 27, 28]. The second group of algorithms does not assume any seed and aims at nding all (or most) of the communities by exploring the
462
3.
PRELIMINARIES
463
start with the case of the isolated complete bipartite graph. Consider a node u X, clearly N+ (u) = Y , and y N+ (u), N (y) = X, thus w N (y), N+ (w) = Y . Turning to the cardinalities: for a node u X, y N+ (u), w N (y) d+ (w) = |Y |. Thus also the average value of all outdegrees for nodes in N (y) is |Y |. In formulae: given u X, y N+ (u), 1 d (y) X d+ (w) = |Y |.
wN (y)
Next we average over all y N+ (u) by obtaining the following equation: given u X, X X 1 P d+ (w) = |Y |. (y) yN+ (u) d +
yN (u) wN (y)
For 1 the dierence tends to zero. Finally assuming e that for a -dense bipartite subgraph of G the excesses N(u)\ e X and N + (N(u)) \ Y give a small contribution, we can still use the above test as evidence of the presence of a dense subgraph. At this point we pause, we state our rst criterion and we subject it to criticism in order to improve it. e Criterion 1. If d+ (u) and |N(u)| are big enough and X + 1 d+ (u) d (v), e |N(u)| then e e N(u), N+ (N(u)) might contain a community.
e vN(u)
Next we see how to transform the above equality for isolated -dense graphs. Consider a node u X, now N+ (u) Y , and for a node v Y , N (v) X. Thus we get the bounds: X d (y) |Y | |X|, |X||Y |
yN+ (u)
|Y |2 |X|
d+ (w)
2 |Y |2 |X|.
Thus the ratio of the two quantities is in the range |Y | [ , |Y | 2 ]. On the other hand |Y | d+ (u) |Y |. Therefore the dierence of the two terms is bounded by 2 2 |Y | 1 , which is bounded by d+ (u) 1 . Again for 2 1 and 1 the dierence tends to zero. Thus in an approximate sense the relationship is preserved for isolated -dense bipartite graphs. Clearly now we will make a further relaxation by considering the sets N+ (.) and N (.) as referred to the overall graph G, instead of just the isolated pair (X, Y ). e Criterion 2. If d+ (u) and |N(u)| are big enough and X X 1 d+ (w), d+ (u) P d (y) yN+ (u) +
yN (u) wN (y)
then
464
e e remove from N(u) all vertices v for which N+ (v) N+ (N(u)) + e is small, and we remove from N (N(u)) all vertices w for e which N (w) N(u) is small.
Algorithm RobustDensityEstimation Input: A directed graph G = (V, E), a threshold for degrees Result: A set S of dense subgraphs detected by vertices of outdegrees > threshold begin Init: forall u of G do forall v N (u) do TabSum[u] TabSum[u] + d+ (v) end end Search: forall u that is not already a fan of a community and s.t. d+ (u) > threshold do sum 0; nb 0; forall v N+ (u) do sum sum + TabSum[v]; nb nb + d (v); end if sum/nb d+ (u) and nb > d+ (u) threshold then S S ExtractCommunity(u); end end Return S; end Figure 2: RobustDensityEstimation performs the main ltering step.
4.2 Algorithms
In gures 2 and 3 we give the pseudo-code for our heuristic. Algorithm RobustDensityEstimation detects vertices that satisfy the ltering formula of criterion 2, then funce e tion ExtractCommunity computes N(u) and N+ (N(u)) and extracts the community of which u is a fan. This two algorithms are a straightforward application of the formula in the criterion 2.
case we do not miss any important structure of our data. Observe that the last loop of function ExtractCommunity removes logically from the graph all arcs of the current community, but not the vertices. Moreover, a vertex can be fan of a community and center of several communities. In particular it can be fan and center for the same community, so we are able to detect dense quasi bipartite subgraphs as well as quasi cliques.
465
Function ExtractCommunity Input: A vertex u of a directed graph G = (V, E). Slackness parameter Result: A community of which u is a fan begin Initialization: forall v N+ (u) do forall w N (v) that is not already a fan of a community do if d+ (w) > (1 )d+ (u) then mark w as potential fan end end forall potential fan v do forall w N+ (v) do mark w as potential center; end end Iterative renement: repeat Unmark potential fans of small local outdegree; Unmark potential centers of small local indegree; until Number of potential fans and centers have not changed signicatively Update global data structures: forall potential fan v do forall w N+ (v) that is also a potential center do TabSum[w] TabSum[w] d+ (v); d (w) d (w) 1; end end Return (potential fans, potential centers); end Figure 3: ExtractCommunity extracts the dense subgraph.
4.5 Scalability
The algorithm we described, including the initial cleaning steps, can be easily converted to work in the streaming model, except for procedure ExtractCommunity that seems to require the use of random access of data in core memory. Here we want to estimate with a back of the envelope calculation the limits of this approach using core memory. Andrei Broder et al. [6] in the year 2000 estimated the size of the indexable web graph at 200M pages and 1.5G edges (thus an average degree about 7.5 links per page, which is consistent with the average degree 8.4 of the WebBase data of 2001). A more recent estimate by Gulli and Signorini [22] in 2005 gives a count of 11.5G pages. The latest indexsize war ended with Google claiming an index of 25G pages. The average degree of the webgraph has been increasing recently due to the dynamic generation of pages with high degree, and some measurements give a count of 40.2 The initial cleaning phase reduces the WebBase graph by a factor 0.17 in node count and 0.059 in the Edge count. Thus using these coecients the cleaned web graph might have 4.25G nodes and 59G arcs. The compression techniques in [5] for the WebBase dataset achieves an overall performance of 3.08 bits/edge. These coecient applied to our cleaned web graph give a total of 22.5Gbytes to store the graph. Storing the graph G and its transpose we need to double the storage (although here some saving might be achieved), thus achieving an estimate of about 45Gbytes. With current technology this amount of core memory can certainly be provided by state of the art multiprocessors mainframes
2
(e.g IBM System Z9 sells in congurations ranging from 8 to 64 GB of RAM core memory).
5. TESTING EFFECTIVENESS
By construction algorithms RobustDensityEstimation and ExtractCommunity return a list of dense subgraph (where size and density are controlled by the parameters t and ). Using standard terminology in Information Retrieval we can say that full precision is guaranteed by default. In this section we estimate the recall properties of the proposed method. This task is complex since we have no ecient alternative method for obtaining a guaranteed ground truth. Therefore we proceed as follows. We add some arcs in the graph representing the Italian domain of the year 2004, so to create new dense subgraphs. Afterwards, we observe how many of these new communities are detected by the algorithm that is run blindly with respect to the articially embedded community. The number of edges added is of the order of only 50,000 and it is likely that the nature of a graph with 100M edges is not aected. In the rst experiment, about detecting bipartite communities, we introduce 480 dense bipartite subgraphs. More precisely we introduce 10 bipartite subgraphs for each of the 48 categories representing all possible combinations of number of fans, number of centers, and density over a number of fans is chosen in {10, 20, 40, 80}; number of centers chosen in {10, 20, 40, 80}; and density randomly chosen in the ranges [0.25, 0.5] (low), [0.5, 0.75] (medium), and [0.75, 1] (high). Moreover, the fans and centers of every new community are chosen so that they dont intersect any community found in the original graph nor any other new community. The following table (Table 1) shows how many added communities
466
Table 1: Number of added bipartite communities found with threshold=8 depending on number of fans, centers, and density. In the second experiment, about detecting cliques , we introduce ten cliques for each of 12 classes representing all possible combinations over: number of pages in {10, 20, 30, 40}, and density randomly chosen in the ranges [0.25, 0.5], [0.5, 0.75], and [0.75, 1]. The following table (Table 2) shows how many such cliques are found in average over 70 experiments. Again the maximum recall number per entry is 10. 40 30 20 10 9.6 8.5 3.6 0 Low 9.8 9.7 9.4 9.3 7.6 8.3 0.1 3.5 Med High Density # Pages
Table 2: Number of added clique communities found with threshold=8 depending on number of pages and density. The cleaned .it 2004 graph used for the test has an average degree roughly 6 (see Section 6). A small bipartite graph of 10-by-10 nodes or a small clique of 10 nodes at 50% density has an average degree of 5. The breakdown of the degreecounting heuristic for these low thresholds is easily explained with the fact that these small and sparse communities are eectively hard to distinguish from the background graph by simple degree counting.
6.
In this section we apply our algorithm to the task of extracting and classifying real large communities in the web.
Thresh. 10 15 20 25
Table 6 shows how many communities are found with the threshold equal to 10, in the three data sets in function of number of fans, centers, and density. Low, medium and high densities are respectively the ranges [0.25, 0.5], [0.5, 0.75], and [0.75, 1].
467
Figure 5: Number of communities found by Algorithm RobustDensityEstimation as a function of the degree threshold. The gray scale denotes a partition of the communities by density. Web 2001 # com. # loops Time 5686 2.7 2h12min 2412 2.8 1h03min 1103 2.8 31min 616 2.6 19min Italy 2004 # com. # loops 1099 2.7 452 2.8 248 2.8 153 2.8 Uk 2005 # com. # loops 4220 2.5 2024 2.6 1204 2.7 767 2.7
Thresh. 10 15 20 25
Table 4: Measurements of performance. Number of communities found, total computing time and average number of cleaning loops per community.
7.
VISUALIZATION OF COMMUNITIES
The compressed data structure in [5] storing the web graph does not hold any information about the textual content of the pages. Therefore, once the list of urls of fans and centers for each community has been created, a nonrecursive crawl of the WWW focussed on this list of urls has been performed in order to recover textual data from communities. What we want is to obtain an approximate description of the community topics. The intuition is that the topic of a community is well described by its centers. As good summary of the content of a center page we extract the text contained in the title tag of the page. We treat fan pages in a dierent way. The full content of the page is probably not interesting because a fan page can contain dierent topics, or might even be part of dierent communities. We extract only the anchor text of the link to a center page because it is a good textual description of the edge from the fan to a center in the community graph. For each community we build a weighted set of words getting all extracted words from centers and fans. The weight of each word takes into account if a word cames from a center and/or a fan and if it is repeated. All the words in a stop word list are removed. We build a at clustering of the communities. For clustering we use the k-center algorithm described in [18, 17]. As a metric we adopt the Generalized Jaccard distance (a weighted form of the standard Jaccard distance). This paper focusses on the algorithmic principles and testing of a fast and eective heuristic for detecting large-tomedium size dense subgraphs in the web graph. The examples of clusters reported in this section are to be considered as anecdotical evidence of the capabilities of the Community Watch System. We plan on using the Community Watch tool for a full-scale analysis of portions of the Web Graph
as future research. In Table 8 we show some high quality clusters of community found by the Community Watch tool in the data-set UK2005 among those communities detected with threshold t = 25 (767 communities). Further ltering of communities with too few centers reduces the number of items (communities) to 636. The full listing can be inspected by using the Community Watch web interface publicly available at http://comwatch.iit.cnr.it.
468
# Centers
6 11 13 17
low
1 9 14 23
med
11 22 100 347
high
Density 100
# Centers
2 3 19 5
low
1 4 11 11
med
12 15 69 247
high
med
United Kingdom 2005 - 4220 100 24 5 18 17 4 [50, 100[ 63 23 55 14 21 [25, 50[ 76 23 151 28 18 [10, 25[ 43 30 299 7 8
low med high low med
Density Density [50, 100[ 100 # of Fans communities found at t=10 15 10 3 14 11 5 51 34 19 11 42 24 22 81 159 16 7 68 51 22 273 266 8 11 159 34 44 705
high low med high low med high
# Centers
Density 100
Table 6: Distribution of the detected communities depending on number of fans, centers, and density, for t = 10. Thresh. 10 15 20 25 # Total 984 290 550 206 354 971 244 751 Web 2001 # in Com. 581 828 286 629 164 501 105 500 Perc. 59% 52% 46% 43% # Total 3 331 358 2 225 414 1 761 160 487 866 Italy 2004 # in Com. 3 031 723 2 009 107 642 960 284 218 Perc. 91% 90% 37% 58% # Total 4 085 309 3 476 321 2 923 794 2 652 204 Uk 2005 # in Com. 3 744 159 3 172 338 2 752 726 2 503 226 Perc. 92% 91% 94% 94%
Table 7: Coverage of communities found in the web graphs. The leftmost column shows the threshold value. For each data set, the rst column is the number of pages with d+ > t, and the second and third columns are the number and percentage of pages that have been found to be a fan of some community.
9.
REFERENCES
[1] J. Abello, M. G. C. Resende, and S. Sudarsky. Massive quasi-clique detection. In Latin American Theoretical Informatics (LATIN), pages 598612, 2002. [2] K. Bharat, A. Z. Broder, J. Dean, and M. R. Henzinger. A comparison of techniques to nd mirrored hosts on the WWW. Journal of the American Society of Information Science, 51(12):11141122, 2000. [3] M. Bianchini, M. Gori, and F. Scarselli. Inside pagerank. ACM Trans. Inter. Tech., 5(1):92128, 2005. [4] P. Boldi, B. Codenotti, M. Santini, and S. Vigna. Ubicrawler: A scalable fully distributed web crawler. Software: Practice and Experience, 34(8):711726, 2004. [5] P. Boldi and S. Vigna. The webgraph framework I: Compression techniques. In WWW 04, pages 595601, 2004. [6] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. Computer Networks,
33(1-6):309320, 2000. [7] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630659, 2000. [8] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Selected papers from the sixth international conference on World Wide Web, pages 11571166, Essex, UK, 1997. Elsevier Science Publishers Ltd. [9] A. Capocci, V. D. P. Servedio, G. Caldarelli, and F. Colaiori. Communities detection in large networks. In WAW 2004: Algorithms and Models for the Web-Graph: Third International Workshop, pages 181188, 2004. [10] S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. Kleinberg. Mining the link structure of the world wide web. Computer, 32(8):6067, 1999. [11] J. Cho and H. Garcia-Molina. WebBase and the stanford interlib project. In 2000 Kyoto International Conference
469
Table 8: Some notable clusters of communities in the data set UK05 for t = 25. Parameters used for ltering and clustering: # fans=0-1000, # centers=10-max, average degree =10-max, taget=70 clusters (55 done). Communities in the ltered data set: 636. We report, for each cluster, id number, keywords with weights, number of communities in the cluster and how many of these are relevant to the prevalent type.
on Digital Libraries: Research and Practice, 2000. [12] U. Feige. Relations between average case complexity and approximation complexity. In Proc. of STOC 2002, Montreal., 2002. [13] U. Feige and M. Langberg. Approximation algorithms for maximization problems arising in graph partitioning. Journal of Algorithms, 41:174211, 2001. [14] U. Feige, D. Peleg, and G. Kortsarz. The dense k-subgraph problem. Algorithmica, 29(3):410421, 2001. [15] G. W. Flake, S. Lawrence, and C. L. Giles. Ecient identication of web communities. In KDD 00, pages 150160, New York, NY, USA, 2000. ACM Press. [16] G. W. Flake, S. Lawrence, C. L. Giles, and F. Coetzee. Self-organization of the web and identication of communities. IEEE Computer, 35(3):6671, 2002. [17] F. Geraci, M. Maggini, M. Pellegrini, and F. Sebastiani. Cluster generation and cluster labelling for web snippets. In (SPIRE 2006), pages 2536, Glasgow, UK., October 2006. Volume 4209 in LNCS. [18] F. Geraci, M. Pellegrini, P. Pisati, and F. Sebastiani. A scalable algorithm for high-quality clustering of web snippets. In In Proceedings of the 21st Annual ACM Symposium on Applied Computing (SAC 2006), pages 10581062, Dijon, France, April 2006. [19] D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In HYPERTEXT 98, pages 225234, New York, NY, USA, 1998. ACM Press. [20] D. Gibson, R. Kumar, and A. Tomkins. Discovering large dense subgraphs in massive graphs. In VLDB 05, pages 721732. VLDB Endowment, 2005. [21] M. Girvan and M. E. J. Newman. Community structure in social and biological networks. Proc. Natl. Acad. Sci. USA, pages 78217826, 2002. [22] A. Gulli and A. Signorini. The indexable web is more than 11.5 billion pages. In WWW (Special interest tracks and posters), pages 902903, 2005. [23] Z. Gyngyi and H. Garcia-Molina. Web spam taxonomy. In o First International Workshop on Adversarial Information Retrieval on the Web, 2005. [24] Q. Han, Y. Ye, H. Zhang and J. Zhang. Approximation of dense k-subgraph, 2000. Manuscript.
[25] J. Hastad. Clique is hard to approximate within n1 . Acta Mathematica, 182:105142, 1999. [26] M. Henzinger. Algorithmic challenges in web search engines. Internet Mathematics, 1(1):115126, 2002. [27] N. Imafuji and M. Kitsuregawa. Finding a web community by maximum ow algorithm with hits score based capacity. In DASFAA 2003, pages 101106, 2003. [28] H. Ino, M. Kudo, and A. Nakamura. Partitioning of web graphs by community topology. In WWW 05, pages 661669, New York, NY, USA, 2005. ACM Press. [29] H. Kautz, B. Selman, and M. Shah. Referral Web: Combining social networks and collaborative ltering. Communications of the ACM, 40(3):6365, 1997. [30] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Extracting large-scale knowledge bases from the web. In VLDB 99, pages 639650, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. [31] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. Computer Networks (Amsterdam, Netherlands: 1999), 31(1116):14811493, 1999. [32] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Method and system for trawling the world-wide web to identify implicitly-dened communities of web pages. US patent 6886129, 2005. [33] S. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Extracting large-scale knowledge bases from the web. In The VLDB Journal, pages 639650, 1999. [34] R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC eect. Computer Networks (Amsterdam, Netherlands: 1999), 33(16):387401, 2000. [35] M. Newman. The structure and function of complex networks. SIAM Review, 45(2):167256, 2003. [36] P. K. Reddy and M. Kitsuregawa. An approach to relate the web communities through bipartite graphs. In WISE 2001, pages 301310, 2001. [37] B. Wu and B. D. Davison. Identifying link farm spam pages. In WWW 05, pages 820829, New York, NY, USA, 2005. ACM Press.
470