distributed DataMining
distributed DataMining
Grigorios Tsoumakas*
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, 54124
Greece
voice: +30 2310-998418
fax: +30 2310-998419
email: [email protected]
Ioannis Vlahavas
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, 54124
Greece
voice: +30 2310-998418
fax: +30 2310-998419
email: [email protected]
(* Corresponding author)
Distributed Data Mining
INTRODUCTION
comprise several, and different sources of large volumes of data and several
Internet, where increasingly more databases and data streams appear that deal with
developed in the last few years are sensor networks for process monitoring and grids
where a large number of computing and storage units are interconnected over a high-
speed network.
central processing. However, this is usually either ineffective or infeasible for the
following reasons:
(1) Storage cost. It is obvious that the requirements of a central storage system are
enormous. A classical example concerns data from the astronomy science, and
especially images from earth and space telescopes. The size of such databases is
reaching the scale of exabytes (1018 bytes) and is increasing at a high pace. The
central storage of the data of all telescopes of the planet would require a huge data
(2) Communication cost. The transfer of huge data volumes over network might take
extremely much time and also require an unbearable financial cost. Even a small
volume of data might create problems in wireless network environments with limited
distributed databases are not always constant and unchangeable. On the contrary, it is
common to have databases that are frequently updated with new data or data streams
that constantly record information (e.g remote sensing, sports statistics, etc.).
(3) Computational cost. The computational cost of mining a central data warehouse is
much bigger than the sum of the cost of analyzing smaller parts of the data that could
also be done in parallel. In a grid, for example, it is easier to gather the data at a
(4) Private and sensitive data. There are many popular data mining applications that
deal with sensitive data, such as people’s medical and financial records. The central
collection of such data is not desirable as it puts their privacy into risk. In certain
cases (e.g. banking, telecommunication) the data might belong to different, perhaps
and systems that deal with the above issues in order to discover knowledge from
Distributed Data Mining (DDM) (Fu, 2001; Park & Kargupta, 2003) is
concerned with the application of the classical Data Mining procedure in a distributed
(communication network, computing units and databases). Data Mining takes place
both locally at each distributed site and at a global level where the local knowledge is
phase normally involves the analysis of the local database at each distributed site.
Then, the discovered knowledge is usually transmitted to a merger site, where the
integration of the distributed local models is performed. The results are transmitted
back to the distributed databases, so that all sites become updated with the global
knowledge. In some approaches, instead of a merger site, the local models are
broadcasted to all other sites, so that each site can in parallel compute the global
model.
(1) Local Mining
(2) Transmit
DB 1 statistics and/or
local models to (3) Global Mining
a merger site
DB 2
Merger
Site
… (4) Transmit
global statistics
and/or model to
local databases
DB N
the former case, the attributes describing the data are the same in each distributed
database. This is often the case when the databases belong to the same organization
(e.g. local stores of a chain). In the latter case the attributes differ among the
heterogeneous databases, which will allow the association between tuples. In other
applications the target attribute for prediction might be common across all distributed
databases.
MAIN FOCUS
Distributed Classification and Regression
from methods that appear in the area of ensemble methods, such as Stacking,
others extend the existing approaches in order to minimize the communication and
Chan and Stolfo (1993) applied the idea of Stacked Generalization (Wolpert,
distributed data sets and investigated various schemes for structuring the meta-level
training examples. They showed that meta-learning exhibits better performance with
respect to majority voting for a number of domains. Knowledge Probing (Guo &
independent data set, called the probing set, in order to discover a comprehensible
model. The output of a meta-learning system on this independent data set together
with the attribute value vector of the same data set are used as training examples for a
& Johnson, 2000) allows the learning of classification and regression models over
appropriately chosen sample of each data set to a single site and generates the
canonical representation.
A number of approaches have been presented for learning a single rule set
from distributed data. Hall, Chawla and Bowyer (1997; 1998) present an approach
that involves learning decision trees in parallel from disjoint data, converting trees to
rules and then combining the rules into a single rule set. Hall, Chawla, Bowyer and
Kegelmeyer (2000) present a similar approach for the same case, with the difference
that rule learning algorithms are used locally. In both approaches, the rule
combination step starts by taking the union of the distributed rule sets and continues
by resolving any conflicts that arise. Cho and Wüthrich (2002) present a different
approach that starts by learning a single rule for each class from each distributed site.
Subsequently, the rules of each class are sorted according to a criterion that is a
combination of confidence, support and deviation, and finally the top k rules are
selected to form the final rule set. Conflicts that appear during the classification of
new instances are resolved using the technique of relative deviation (Wüthrich, 1997).
the generalized AdaBoost learning algorithm (Schapire and Singer, 1999) for DDM.
At each round of the algorithm, a different site takes the role of training a weak model
Then, the update coefficient αt is computed based on the examples of all distributed
sites and the weights of all examples are updated. Experimental results show that the
learning a single classifier from the union of the distributed data sets, but only in
certain cases comparable to boosting that single classifier. The distributed boosting
algorithm of Lazarevic and Obradovic (2001) at each round learns a weak model in
each distributed site in parallel. These models are exchanged among the sites in order
to form an ensemble, which takes the role of the hypothesis. Then, the local weight
vectors are updated at each site and their sums are broadcasted to all distributed sites.
This way each distributed site maintains a local version of the global distribution
without the need of exchanging the complete weight vector. Experimental results
even slightly better than boosting on the union of the distributed data sets.
Agrawal and Shafer (1996) discuss three parallel algorithms for mining
association rules. One of those, the Count Distribution (CD) algorithm, focuses on
minimizing the communication cost, and is therefore suitable for mining association
(Agrawal and Srikant, 1994) locally at each data site. In each pass k of the algorithm,
each site generates the same candidate k-itemsets based on the globally frequent
itemsets of the previous phase. Then, each site calculates the local support counts of
the candidate itemsets and broadcasts them to the rest of the sites, so that global
support counts can be computed at each site. Subsequently, each site computes the k-
frequent itemsets based on the global counts of the candidate itemsets. The
synchronization step when each site waits to receive the local support counts from
Association rules (DMA) algorithm (Cheung, Ng, Fu & Fu, 1996), which is also
found as Fast Distributed Mining of association rules (FDM) algorithm in (Cheung,
Han, Ng, Fu & Fu, 1996). DMA generates a smaller number of candidate itemsets
than CD, by pruning at each site the itemsets that are not locally frequent. In addition,
it uses polling sites to optimize the exchange of support counts among sites, reducing
DMA over CD are based on the assumption that the data distributions at the different
sites are skewed. When this assumption is violated, DMA actually introduces a larger
(Ashrafi, Taniar & Smith, 2004) follows the paradigm of CD and DMA, but attempts
mining level, it proposes a technical extension to the Apriori algorithm. It reduces the
size of transactions by: i) deleting the items that weren’t found frequent in the
previous step and ii) deleting duplicate transactions, but keeping track of them
through a counter. It then attempts to fit the remaining transaction into main memory
in order to avoid disk access costs. At the communication level, it minimizes the total
called receiver. The receiver broadcasts the globally frequent itemsets back to the
distributed sites.
Distributed Clustering
(CHC) algorithm for clustering distributed heterogeneous data sets, which share a
common key attribute. CHC comprises three stages: i) local hierarchical clustering at
each site, ii) transmission of the local dendrograms to a facilitator site, and iii)
generation of a global dendrogram. CHC estimates a lower and an upper bound for
the distance between any two given data points, based on the information of the local
dendrograms. It then clusters the data points using a function on these bounds (e.g.
the dendrogram that would be produced if all data were gathered at a single site.
hierarchical clustering algorithm locally at each site. For each cluster in the hierarchy
data points in the cluster. The local dendrograms along with the descriptive statistics
are transmitted to a merging site, which agglomerates them in order to construct the
final global dendrogram. Experimental results show that RACHET achieves good
Januzaj, Kriegel and Pfeifle (2004) present the Density Based Distributed
representative points that accurately describe each local cluster are selected. Finally,
Database Clustering
property. The data distributions at different sites are not identical. For example, data
related to a disease from hospitals around the world might have varying distributions
due to different nutrition habits, climate and quality of life. The same is true for
example.
knowledge. If all databases are considered as a single logical entity then the
idiosyncrasies of different sites will not be detected. On the other hand if each
database is mined separately, then knowledge that concerns more than one database
might be lost. The solution that several researchers have followed is to cluster the
databases themselves, identify groups of similar databases, and apply DDM methods
the association rules at each database. McClean, Scotney, Greer and Páircéir (2001)
consider the clustering of heterogeneous databases that hold aggregate count data.
They experimented with the Euclidean metric and the Kullback-Leibler information
divergence for measuring the distance of aggregate data. Tsoumakas, Angelis and
tasks. They cluster the classification models that are produced at each site based on
the differences of their predictions in a validation data set. Experimental results show
that the combining of the classifiers within each cluster leads to better performance
One trend that can be noticed during the last years is the implementation of
DDM systems using emerging distributed computing paradigms such as Web services
applications and algorithms for P2P environments. McConnell and Skillicorn (2005)
present a distributed approach for prediction in sensor networks, while Davidson and
Ravi (2005) present a distributed approach for data pre-processing in sensor networks.
CONCLUSION
DDM enables learning over huge volumes of data that are situated at different
and intrusion detection, to market basket analysis over a wide area, to knowledge
algorithms and systems will continue to play an important role. New distributed
applications will arise in the near future and DDM will be challenged to provide
Agrawal, R. & Shafer J.C. (1996). Parallel Mining of Association Rules. IEEE
Agrawal R. & Srikant, R. (1994, September). Fast Algorithms for Mining Association
Chan, P. & Stolfo, S. (1993). Toward parallel and distributed learning by meta-
Databases, 227-240.
Cheung, D.W., Han, J., Ng, V., Fu, A.W. & Fu, Y. (1996, December). A Fast
Cheung, D.W., Ng, V., Fu, A.W. & Fu, Y. (1996). Efficient Mining of Association
Datta, S, Bhaduri, K., Giannella, C., Wolff, R. & Kargupta, H. (2006). Distributed
26.
Davidson I. & Ravi A. (2005). Distributed Pre-Processing of Data on Networks of
Fan, W., Stolfo, S. & Zhang, J. (1999, August). The Application of AdaBoost for
Hall, L.O., Chawla, N., Bowyer, K. & Kegelmeyer, W.P. (2000). Learning Rules
Hall, L.O., Chawla, N. & Bowyer, K. (1998, July). Decision Tree Learning on Very
and Cybernetics.
Hall, L.O., Chawla, N. & Bowyer, K. (1997). Combining Decision Trees Learned in
Data Mining.
Knowledge discovery and data mining, San Francisco, California, USA, 311-
316.
McClean, S., Scotney, B., Greer, K. & P Páircéir, R. (2001). Conceptual Clustering of
Park, B. & Kargupta, H. (2003). Distributed Data Mining: Algorithms, Systems, and
13-16, 566-574.
Samatova, N.F., Ostrouchov, G., Geist, A. & Melechko A.V. (2002). RACHET: An
Data Skewness: The observation that the probability distribution of the same
Distributed Data Mining (DDM): A research area that is concerned with the
Global Mining: The combination of the local models and/or sufficient statistics in
order to produce the global model that corresponds to all distributed data.
Grid: A network of computer systems that share resources in order to provide a high
Local Mining: The application of data mining algorithms at the local data of each
distributed site.
Sensor Network: A network of spatially distributed devices that use sensors in order