Towards Data Mining in Large and Fully Distributed Peer To Peer Overlay Networks
Towards Data Mining in Large and Fully Distributed Peer To Peer Overlay Networks
Abstract
The Internet, which is becoming a more and more dynamic, extremely heterogeneous network has
recently became a platform for huge fully distributed peer-to-peer overlay networks containing millions
of nodes typically for the purpose of information dissemination and file sharing. This paper targets the
problem of analyzing data which are scattered over a such huge and dynamic set of nodes, where each
node is storing possibly very little data but where the total amount of data is immense due to the large
number of nodes. We present distributed algorithms for effectively calculating basic statistics of data
using the recently introduced newscast model of computation and we demonstrate how to implement
basic data mining algorithms based on these techniques. We will argue that the suggested techniques are
efficient, robust and scalable and that they preserve the privacy of data.
1 Introduction
With the rapid increase in the number of computers connected to the Internet and the emergence of a range
of mobile computational devices which might soon be equipped with mobile IP technology, the Internet
is converging to a more dynamic, huge, extremely heterogeneous network which nevertheless provides
basic services such as routing and name lookup. This platform is already being used to support huge,
fully distributed peer-to-peer overlay networks containing millions of nodes typically for the purpose of
information dissemination and file sharing [8]. Such fully distributed systems generate immense amounts
of data. Analyzing this data can be interesting from both scientific and business purposes. Among other
applications, this environment is a natural target for distributed data mining [10].
In this paper we would like to push the concept of distributed data mining to the extreme. The mo-
tivations behind distributed data mining include the optimal usage of available computational resources,
privacy and dependability by eliminating critical points of service. We will adopt the harshest possible
constraints on the distribution of data and the elements of the network and demonstrate techniques which
can still provide useful information about the distributed data effectively and dependably.
There are two constraints that we will adopt. The first is that all nodes are allowed to hold as few as
one single data instance. This can be viewed as an extremum of horizontal data distribution. The second
is another extremum: there is practically no limit on the number of nodes. The only requirement is that in
principle each pair of nodes could communicate directly which holds if the nodes are on the Internet with
a (not necessarily fixed) IP address.
Furthermore, we will concentrate on two other very important aspects. The first is data privacy, the
second is the dynamic nature of the underlying network: nodes can leave the overlay network and new
nodes can join it.
To achieve our goal we will work in the newscast model of computation [5]. This model is built on a
lower layer, an epidemic protocol for disseminating information and group membership [4], and it provides
∗ in the proc. of BNAIC’03, pp203–210, Nijmegen, The Netherlands, 2003
1
a simple interface for applications. The advantage of the model is that due to the robustness and scalability
of the epidemic protocol it is built on, the applications of the newscast model of computation inherit this
robustness and scalability and can target the kinds of distributed networks described above.
Using the same argument as above we can show that the SA algorithm reduces the variance of the input
data exponentially fast. Moreover, the system reacts to changes in the input data within d iterations.
References
[1] A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and
D. Terry. Epidemic algorithms for replicated database management. In Proceedings of the 6th Annual
ACM Symposium on Principles of Distributed Computing (PODC’87), pages 1–12, Vancouver, Aug.
1987. ACM.
[2] P. T. Eugster, R. Guerraoui, S. B. Handurukande, A.-M. Kermarrec, and P. Kouznetsov. Lightweight
probablistic broadcast. In Proceedings of the International Conference on Dependable Systems and
Networks (DSN’01), Göteborg, Sweden, 2001.
[3] D. Hand, H. Manilla, and P. Smyth. Principles of Data Mining. The MIT Press, Cambridge, Mas-
sachusetts, London, England, 2001.
[4] M. Jelasity, M. Preuß, M. van Steen, and B. Paechter. Maintaining connectivity in a scalable and
robust distributed environment. In H. E. Bal, K.-P. Löhr, and A. Reinefeld, editors, Proceedings of
the Second IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid2002),
pages 389–394, Berlin, Germany, 2002. IEEE, IEEE Computer Society.
[5] M. Jelasity and M. van Steen. Large-scale newscast computing on the Internet. Technical Report IR-
503, Vrije Universiteit Amsterdam, Department of Computer Science, Amsterdam, The Netherlands,
Oct. 2002. http://www.cs.vu.nl/globe/techreps.html.
[6] A.-M. Kermarrec, L. Massoulié, and A. J. Ganesh. Probablistic reliable dissemination in large-scale
systems. IEEE Transactions on Parallel and Distributed Systems, 2003. To appear.
[7] W. Kowalczyk, M. Jelasity, and A. Eiben. : Towards data mining in large and fully distributed peer-
to-peer overlay networks. Technical Report IR-AI-003, Vrije Universiteit Amsterdam, Department
of Computer Science, Amsterdam, The Netherlands, May 2003.
[8] D. S. Milojicic, V. Kalogeraki, R. Lukose, K. Nagaraja, J. Pruyne, B. Richard, S. Rollins, and Z. Xu.
Peer-to-peer computing. Technical Report HPL-2002-57, HP Laboratories Palo Alto, 2002.
[9] B. Paechter, T. Bäck, M. Schoenauer, M. Sebag, A. E. Eiben, J. J. Merelo, and T. C. Fogarty. A
distributed resource evolutionary algorithm machine (DREAM). In Proceedings of the 2000 Congress
on Evolutionary Computation (CEC 2000), pages 951–958. IEEE, IEEE Press, 2000.
[10] B.-H. Park and H. Kargupta. Distributed data mining: Algorithms, systems, and applications. In
N. Ye, editor, The Handbook of Data Mining. Lawrence Erlbaum Associates, Inc., 2003.