AR Grid Computing
AR Grid Computing
ABSTRACT 1. INTRODUCTION
Grids are now regarded as promising platforms for data and Powerful tools are needed to extract knowledge from the
computation-intensive applications like data mining. However, growing amount of data being collected and stored. Association
the exploration of such large-scale computing resources Rule mining technique which tends to find interesting correlation
necessitates the development of new distributed algorithms. The relationships between items in voluminous transactional databases
major challenge facing the developers of distributed data mining has become one of the main data mining techniques [3].
algorithms is how to adjust the load imbalance that occurs during Classic data mining techniques, based on sequential approaches,
execution. This load imbalance is due to the dynamic nature of are often inadequate due to the complexity of the analysis that
data mining algorithms (i.e. we cannot predict the load before need to be performed on a huge amount of data. Sequential
execution) and the heterogeneity of Grid computing systems. In approaches cannot provide the scalability, in terms of the data
this paper, we propose a dynamic load balancing strategy for dimensionality, size, and runtime performance. Moreover, the
distributed association rule mining algorithms under a Grid increasing trend towards decentralized business organizations,
computing environment. We evaluate the performance of the distribution of users, software, and hardware systems magnifies
proposed strategy by the use of Grid’5000. A Grid infrastructure the need for more advanced and flexible approaches and
distributed in nine sites around France, for research in large-scale solutions.
parallel and distributed systems. Parallel association rule mining is designed for tightly-coupled
systems, like shared or distributed memory machines, and clusters
based on fast networks. Distributed association rule mining deals
Categories and Subject Descriptors
with loosely-coupled systems: clusters with average-fast or slow
H.2.8 [Database Management]: Database Applications – Data
networks and geographically distributed computing nodes. Grid
mining.
association rule mining supports the distribution of data and
I.6.4 [Simulation and Modeling]: Model Validation and allows the use of geographically dispersed computing resources
Analysis. (at a lower cost) in order to achieve performances not ordinarily
attainable on a single computational resource.
D.1.3 [Software]: Concurrent Programming – Distributed Although Grid computing systems shares many commonalities
programming. with parallel and distributed approaches but there are platform
F.1.2 [Theory of Computation]: Modes of Computation – peculiarities and requirements implying extra efforts and new
Parallelism and concurrency. methodologies to deal with the heterogeneity of such systems.
The majority of existing parallel and distributed ARM algorithms
were designed for a homogeneous dedicated environment and
General Terms thus use static load balancing strategies. By running these
Algorithms, Design, Experimentation, Performance. algorithms under Grid systems their performance degrades due to
the load imbalance that appears between resources during
Keywords execution time. This load imbalance is caused by the dynamic
Association rules, Dynamic load balancing, Grid Computing, nature of the association rule mining algorithm and also by the
Parallel association rule mining, Work migration, Apriori heterogeneity of such computing systems. Because of that we
algorithm. need to develop new load balancing schemes that would allow a
better exploration of this advanced platform.
Permission to make digital or hard copies of all or part of this work for In this paper, we develop and evaluate a run time load
personal or to
Permission classroom use is or
make digital granted
hard without
copies of feeall
provided
or partthat copies
of this workare for
balancing strategy for mining association rule algorithms under a
not madeororclassroom
personal distributedusefor profit orwithout
is granted commercial advantage
fee provided thatand that are
copies
copies
not madebear
or this notice for
distributed andprofit
the full citation on the
or commercial first page.
advantage andTo thatcopy
copies grid computing environment. The goal of our strategy is to
bear this notice
otherwise, and the full
or republish, to citation
post ononservers
the firstorpage. To copy otherwise,
to redistribute to lists, to improve the performance of distributed algorithms and ameliorate
republish, to post
requires prior on servers
specific or to redistribute
permission and/or a fee.to lists, requires prior specific their response time. The rest of the paper is organized as follows:
permission
PADTAD'11, and/or
July a17,
fee.
2011, Toronto, ON, Canada. Section 2 introduces association rule mining technique. Section 3
PADTAD’11, July 17, 2011, Toronto, ON, Canada
Copyright 2011
Copyright 2011 ACM
ACM 978-1-4503-0809-0/11/05
978-1-4503-0809-0/11/07... $10.00.
...$10.00 describes the load balancing problem. Section 4 presents the
53
system model of a Grid. In section 5, we propose the dynamic approach to compute itemsets supports, by exploiting a counting-
load balancing strategy. Experimental results obtained from based method (with a horizontal database layout) during its first
implementing this strategy are shown in section 6. Finally, the iterations and an intersection-based technique (with a vertical
paper concludes with section 7. database layout) when the pruned dataset can fit into the main
memory.
FP-growth algorithm [5] allows frequent itemsets discovery
2. ASSOCIATION RULE MINING
without candidate itemsets generation. First it builds from the
TECHNIQUE transactional database a compact data structure called the FP-tree
Association rules mining (ARM) finds interesting correlation then extracts frequent itemsets directly from the FP-tree.
relationships among a large set of data items. This technique
could be applied to a diversity of domains and the knowledge
gained can be used in applications ranging from business 2.2 Parallel Association Rule Mining
management, production control, and market analysis, to Algorithms
engineering design and science exploration. Association rule mining algorithms suffer from a high
A typical example of this technique is market basket analysis. computational complexity which derives from the size of its
This process analyses customer buying habits by finding search space and the high demands of data access. Parallelism is
associations between different items that customers place in their expected to relieve these algorithms from the sequential
“shopping baskets”. Such information may be used to plan bottleneck, providing the ability to scale the massive datasets, and
marketing or advertising strategies, as well as catalog design [3]. improving the response time. However, parallelizing these
Each basket represents a different transaction in the transactional algorithms is not trivial and is facing many challenges including
database, associated to this transaction the items bought by a the workload balancing problem. Many parallel algorithms for
customer. Given a transactional database D, an association rule solving the frequent set counting problem have been proposed.
has the form A=>B, where A and B are two itemsets, and Most of them use Apriori algorithm [11] as fundamental
A∩B=∅. The rule’s support is the joint probability of a algorithm, because of its success on the sequential setting. The
transaction containing both A and B at the same time, and is given reader could refer to the survey of Zaki on association rules
as σ(AUB). The confidence of the rule is the conditional mining algorithms and relative parallelization schemas [7].
probability that a transaction contains B given that it contains A Agrawal et al. proposed a broad taxonomy of parallelization
and is given as σ(AUB)/σ(A). A rule is frequent if its support is strategies that can be adopted for Apriori in [10].
greater than or equal to a pre-determined minimum support and There also exist many grid data mining projects, like Discovery
strong if the confidence is more than or equal to a user specified Net, GridMiner, DMGA [9] which provide mechanisms for
minimum confidence [3]. integration and deployment of classical algorithms on grid. Also
Association rule mining is a two-step process: the DisDaMin project that deals with data mining issues (as
1. The first step consists of finding all frequent itemsets association rules, clustering, etc.) using distributed computing
that occur at least as frequently as the fixed minimum [15].
support;
2. The second step consists of generating strong
3. THE NEED OF LOAD BALANCING:
implication rules from these frequent itemsets. PROBLEM DESCRIPTION
Work load balancing is the assignment of work to processors
The overall performance of mining association rules is determined in a way that maximizes application performance [4]. A typical
by the first step which is known as the frequent set counting distributed system will have a number of processors working
problem [3]. independently with each other. Each processor possesses an initial
load, which represents an amount of work to be performed, and
each may have a different processing capacity (i.e. different
2.1 Sequential Association Rule Mining architecture, operating system, CPU speed, memory size and
Algorithms available disk space). To minimize the time needed to perform all
Many sequential algorithms for solving the frequent set tasks, the workload has to be evenly distributed over all
counting problem have been proposed in the literature. We can processors in a way that minimizes both processor idle time and
define two main methods for determining frequent itemsets inter-processor communication.
supports: with candidate itemsets generation [11, 13] and without Work-load balancing process can be generalized into four
candidate itemsets generation [5]. basic steps:
Apriori [11] was the first effective algorithm proposed. This 1. Monitoring processor load and state;
algorithm uses a generate-and-test approach which depends on
generating candidate itemsets and testing if they are frequent. It 2. Exchanging workload and state information between
uses an iterative approach known as a level-wise search, where k- processors;
itemsets are used to explore (k+1)-itemsets. During the initial pass 3. Decision making;
over the database the support of all 1-itemsets is counted.
Frequent 1-itemsets are used to generate all possible candidate 2- 4. Data migration. The decision phase is triggered when
itemsets. Then the database is scanned again to obtain the number the load imbalance is detected to calculate optimal data
of occurrences of these candidates, and the frequent 2-itemsets are redistribution.
selected for the next iteration.
DCI algorithm proposed by Orlando and others [13] is also In the fourth and last phase, data migrates from overloaded
based on candidate itemsets generation. It adopts a hybrid processors to underloaded ones. According to different policies
54
used in the previously mentioned phases, Casavant and kuhl [14] 3. Prediction based schemes require accurate estimate of
classify work-load balancing schemes into three major classes: the future computation time and communication cost to
(1) Static versus dynamic load balancing; (2) Centralized versus establish the performance evaluation model. The
distributed load balancing ; (3) Application-level versus system- experiments set shows that these schemes works
level load balancing. well for short-time task but need further investigation
in case of long-term applications.
3.1 Load Balancing in Parallel Association
Rule Mining Algorithms 4. GRID MODEL
Static load balancing can be used in applications with constant In our study we model a Grid as a collection of T sites with
workloads, as a pre-processor to the computation [4]. Other different computational facilities and storage subsystem. Let
applications require dynamic load balancers that adjust the G=(S1, S2,…, ST) denotes a set of sites, where each site Si is defined
decomposition as the computation proceeds [4, 6]. This is due to as a vector with three parameters Si = (Mi , Coord(Si) , Li), where
their nature which is characterized by workloads that are Mi is the total number of clusters in Si, Coord(Si) is the workload
unpredictable and change during execution. Data mining is one of manager, named the coordinator of Si, which is responsible of
these applications. detecting the workload imbalance and the transfer of the
Parallel association rule mining algorithms have a dynamic appropriate amount of work from an overloaded cluster to another
nature because of their dependency on the degree of correlation lightly loaded cluster within the same site (intra-site) or if it is
between itemsets in the transactional database which cannot be necessary to another remote site (inter-sites). This transfer takes
predictable before execution. Basically, current algorithms assume into account the transmission speed between clusters which is
the homogeneity and stability of the whole system, and new denoted ζijj’ (if the transmission is from cluster clij to cluster clij’).
methodologies are needed to handle the previously mentioned If work migration should be performed between sites, then the
issues. destination site will be fixed through communication between the
coordinators of different sites. This communication is done in a
unidirectional ring topology via a token passing mechanism. And
3.2 Load Balancing in Grid Computing Li is the computational load of Si.
Although intensive works have been done in load balancing, Each cluster is characterized by a vector of four parameters
the different nature of a Grid computing environment from the clij=(Nij , Coord(clij) , Lij, ωij), where Nij is the total number of
traditional distributed system, prevent existing static load nodes in clij , Coord(clij) is the coordinator node of clij which
balancing schemes from benefiting large-scale applications. An ensures a dynamic smart distribution of candidates to its own
excellent survey from Y. Li et al. [17], displays the existing nodes, Lij is the computational load of cluster clij and ωij is its
solutions and the new efforts in dynamic load balancing that aim processing time which is the mean of processing times of cluster's
to address the new challenges in Grid. nodes. In fact each node ndijk has its processing time denoted ωijk .
The work done so far to cope with one or more challenges Thus
brought by Grid: heterogeneity, resource sharing, high latency and
dynamic system state, can be identified by three categories as
mentioned in [17]: ω ij = Average(ω i jk ) = (∑ ω i jk ) / N ij (1)
1. Repartition methods focus on calculating a new k
optimized workload distribution based on the current
Figure 1 shows the Grid system model. To avoid keeping
load distribution and system state. Though many variant
global state information in a large-scale system (where this
repartition algorithms try to make balance between the information would be very huge), the proposed load balancing
load balance and the migration cost, they share the model is distributed in both intra-site and inter-sites. Each site in
drawback that stems from the prior knowledge of the the Grid has a workload manager, called the coordinator, which
workload and system state, and cannot adapt to the accommodates submitted transactional database partitions and the
system changes. Meanwhile, most algorithms are based list of candidates of the previous iteration of the association rules
on the homogenous processors that are often not the mining algorithm. Each coordinator aims at tracking the global
workload status by periodically exchanging a “state vector” with
cases in Grid.
other coordinators in the system. Depending on the workload state
2. Divisible load theory based schemes arbitrarily partition of each node, the frequency of candidate itemsets may be
the workload of the application and distribute it to calculated in its local node or will be transferred to another lightly
multiple processors or machines in a linear way, giving loaded node within the same site. If the coordinator cannot fix the
a tractable methodology in load distribution modeling workload imbalance locally, it selects part of transactions to be
sent to a remote site through the network. The destination of
both computation and communication. This method
migrated work is chosen according to the following hierarchy:
takes full consideration of the communication latency in First The coordinator of the cluster Coord(clij) selects the
Grid and makes an optimized load distribution decision available node within the same cluster; If the workload imbalance
to achieve the goal of application-level load balancing. still persists then Coord(clij) searches for an available node in
However, it suffers from the inability to dynamic another cluster but within the same site; Finally, in extreme cases,
changes of the processor utilization rate and fluctuations work will be send to a remote site. The coordinator of the site
of network channel, which limited this approach to Coord(Si) will look for the nearest site available to receive this
workload (i.e. least communication cost).
static load balancing area.
55
Coord (clij) : Clij : Cluster
Cluster j of Si
coordinator
BD1 BD3
Network
…. BD3
.
BD3
BD1
BD2 BD2
Si : Site i
Our model is fault-tolerant. In fact, it takes into consideration determined according to the site processing capacity (i.e.,
the probability of failure of a coordinator node. If the coordinator different architecture, operating system, CPU speed, etc.). It is the
node does not give response within a fixed period of time, an responsibility of the coordinator of the site Coord(Si) to allocate
election policy is invoked to choose another coordinator node. to its site the appropriate database portion according to the site
processing capacity parameters stored in its information system.
5. PROPOSED DYNAMIC LOAD During execution: Our load balancing strategy acts on
three levels:
BALANCING STRATEGY
Dynamic load balancing is necessary for the efficient use of 4. Level one is the migration of work between nodes of the
highly distributed systems (like Grids) and when solving problems same cluster. If the skew in workload still persists the
with unpredictable load estimates (like association rule mining). coordinator of the cluster Coord(clij) moves to the next
That is why we chose to develop a dynamic work load balancing level;
strategy.
Our proposed load balancing strategy depends on three issues: 5. Level two depends on the migration of work between
1. Database architecture (partitioned or not); clusters within the same site;
2. Candidates set (duplicated or partitioned); 6. And finally if work migration of the previous two levels
is not sufficient then the coordinator of the overloaded
3. Network communication parameter (bandwidth).
cluster Coord(clij) asks from the coordinator of the site
The strategy could be adopted by algorithms which depend on Coord(Si) to move to the third level which searches for
candidate itemsets generation to solve the frequent set counting
the possibility of migrating work between sites.
problem. It combines between static and dynamic load balancing
and this by interfering before execution (i.e. static) and during Communication between the coordinators of different
execution (i.e. dynamic). sites is done in a unidirectional ring topology via a
Before execution: To respond to the heterogeneity of the token passing mechanism.
computing system we are using (Grid) the database is not just The following workload balancing process is invoked when
partitioned into equal partitions in a random manner. Rather than needed. It is the responsibility of distributed coordinators to detect
that, the transactional database is partitioned depending on the that need dynamically according to the charge status of their
characteristics of different sites, where the size of each partition is relative nodes:
56
Begin
For All (cli,j in Si) do
Update (Coordi ,GLi(Li,j)) // Update the global vector of workload of each site
End For
∑ Mj=1i Li , j
ALi ← Mi // Calculate the average workload of the site Si
1. From the intra-site level, coordinators of each cluster Where EETi,j is the estimated required processing time
update their global workload vector by acquiring for node Ni,j of the site Si, EETi,k is the estimated
workload information from their local nodes. From the required time to process the same operations in another
Grid level, coordinators of different sites periodically node Ni,k of the same site Si in the case of equation (2),
calculate their average workload in order to detect their EETp,q is the is the estimated required processing time in
workload state (overloaded or underloaded). If an another node Np,q of a remote site Sp in the case of
imbalance is detected, coordinators proceed to the equation (2), CCNi,j,k is the communication cost
following steps. between nodes Ni,j and Ni,k of the site Si, CCSi,p is the
2. The coordinator of the overloaded cluster makes a plan communication cost between sites Si and Sp, Coefintra
for candidates migration intra-site (between nodes of the is the coefficient of decision of the intra-site migration,
same site) using equation (2). If the imbalance still and Coefinter is the coefficient of decision of the inter-
persist, it creates another plan for transactions migration site migration.
inter-sites (between clusters of the Grid) using equation The previously mentioned two equations serves in
(3). ensuring that before performing any candidate itemsets
migration between nodes within the same site, or
transactions migration between different sites, the
EETi,j > Coefintra * ( CCNi,j,k +EETi,k ) (2) coordinator must guaranty that transactions or
candidates migration will improve the performance of
the Grid. The processing time at a local node must
EETi,j > Coefinter * ( CCSi,p + EETp,q ) (3) dominate (by a prefixed threshold) the processing time
at a remote node added to it the time spent in
57
communication and transactions (or candidates) 6. PERFORMANCE EVALUATION
movements. Otherwise, it will be better to process We evaluated the performance of the load balancing strategy
transactions (or candidates) locally. The definition of proposed in section 5 by the use of Grid’5000 the experimental
this coefficient of domination (or threshold) depends of platform dedicated to grid research [1]. We implemented a
the environment of execution and the size of data to be parallel version of the sequential Apriori [11] and we added to it
our work load balancing strategy. We conducted many
processed. The coefficient of the intra-site migration is
experiments with different data sets and by varying the support
smaller than the coefficient for the inter-sites migration, threshold and the number of nodes used for execution.
because the communication cost intra-site is much less
than the communication cost inter-sites.
6.1 Case Study: The Apriori Algorithm
The information used to characterize the load of a
The specific characteristics of the problem of frequent set
node, a cluster or a site are its current load represented counting associated with those of the computing environment
by the number of instructions that need to be executed (Grid) must be taken into account. While association rule mining
and its speed which is quantified in terms of number of method is based on global criteria (support, frequencies), we are
executed instructions per unit of time. In our case all only disposed by local (partial) data views due to the fact of
processing time estimates are calculated through the distribution.
number of candidate itemsets generated at each iteration The treatment must be done on the entire database, comparing
each partition of the base with all the others must be possible in
that the algorithm has to calculate their frequencies. The
order to be able to obtain global information.
estimated required processing time EETi,j for a node Ni,j Our goal is to limit the number of communications and
is calculated by multiplying the cardinal of candidate synchronizations, and to be benefit as much as possible from the
itemsets generated at the beginning of a specific available computing power. This could be done by exploiting all
iteration by the node’s processing time denoted ωij. possible ways of parallelism and if necessary by using a pipeline
And this is done dynamically at the beginning of each approach between dependent tasks in order to be able to
iteration of the association rule mining algorithm. parallelize the various stages of the frequent set counting
algorithm.
3. The concerned coordinator (the coordinator of the In order to evaluate the performance of our workload
overloaded cluster or the coordinator of the overloaded balancing strategy we parallelized the sequential Apriori
site) sends migration plan to all processing nodes and algorithm which is the fundamental algorithm for frequent set
instructs them to reallocate the work load. counting algorithms with candidates’ generation. To reduce the
number of accesses to the transactional database we used the
For each site Si, the coordinator will execute the algorithm depth-first Apriori proposed by W. Kosters et al. [16]. This
displayed in figure 2. Where Coord (Si) is the coordinator node of version of Apriori needs only three passes over the transactional
the site Si, Mi is the number of computational clusters, GLi is the database, while classic Apriori needs k-passes (where k is the
global vector of workloads of all nodes in the site Si, ALi is the length of the maximal itemset). Data parallelism is not sufficient
average workload of the site Si, LMaxi is the threshold of the to improve the performance of association rule mining algorithms.
maximum workload of the site Si, Lij is the local workload of the Subsets of extremely large data sets may also be very large. So, in
cluster clij of the site Si, Limiti,j is the threshold of the maximum order to extract the maximum of parallelism, we applied a hybrid
workload of the cluster cli,j in the site Si , CCNi,j,k is the parallelisation technique (i.e. the combination of data and task
communication cost between clusterd cli,j and cli,k of the site Si, parallelism). This could be done through searching inside the
EETi,j is the estimated required time for cluster cli,j of the site Si to algorithm procedures for independent segments and analyzing the
complete the processing of remaining transactions data, CCSi,p is loops to detect tasks (or instructions) that could be executed
the communication cost between sites Si and Sp, Vi is the state simultaneously.
vector of all the other coordinators in the Grid. The state of the A hybrid approach between candidate duplication and
coordinator of each site is stored in the vector with these candidate partitioning is used. The candidate itemsets are
information: Id-site, CCSi,p and Li. This vector is sorted by CCSi,p duplicated all over the sites of the Grid, but they are partitioned
and Li in order to construct a logical ring of communication between the nodes of each site. The reason for partitioning the
between sites. candidate itemsets is that when the minimum support threshold is
The state of imbalance of a cluster or a site is ascertained on low they overflow the memory space and incur a lot of disk I/O.
the basis of the current load index, where this index is the ratio So, the candidate itemsets are partitioned into equivalence classes
between the load and the speed of treatment. Practically this is based on their common (k-2) length prefixes. A detailed
done by dynamically defining (i.e. during each iteration of the explanation of candidate itemsets clustering could be found in [8].
association rule mining algorithm) the load balancing threasholds We can resume the important basic concepts of our parallelization
(Limiti,j for the cluster and LMaxi for the site). Where Limiti,j is method in what follows:
defined as the average initial load of cli,j added to it the standard
deviation of the workloads of the nodes of cli,j. This help in Site:
measuring the scope of load changes between a cluster and its
nodes. LMaxi takes into consideration the average initial load of • The transactional database is partitioned between sites
Si, added to it the standard deviation of the workloads of the according to the capacity of treatment of each site.
clusters of Si. • Candidate itemsets are duplicated between sites (in
order to reduce the communication cost between sites).
58
Cluster: Table 1. Transactional databases characteristics
• Every database partition is shared between nodes of the # Avg. # Database
Database
same site if they have the same storage subsystem, Items Length Transactions size
otherwise it will be duplicated. DB70T1M 4000 20 1000000 70 Mb
• Candidates are partitioned between site’s clusters
DB100T13M 4000 25 1300000 100 Mb
(according to the capacity of treatment of the cluster)
Figure 3 depicts the execution time obtained from running the
Node : parallel version of Apriori without the work load balancing
• Receives a group of candidates from the coordinator of strategy and the time obtained when the strategy is embedded in
the parallel implementation. The database is initially partitioned
the cluster.
over different sites, where the size of different portions depends
• Calculates their supports. on the site’s capacity (CPU speed, memory size, available disk
• Sends local supports to cluster’s coordinator which space …). We can clearly see that the parallel execution time with
performs the global supports reduction.
(a) DB70T1M
cluster 200
100
Site’s coordinator: 0
0.5% 1% 1.5% 2% 2.5% 3%
• Search for the maximum loaded cluster (or site) and the
min support (%)
minimum loaded cluster (or site).
• Migration of the necessary amount of work (candidates
(b) DB100T13M
or transactions or both) from the maximum to the 2500
Time seq
minimum loaded clusters or sites. // without loadbalancing
2000
// with loadbalancing
59
Figure 4 shows the time needed for workload balancing (work The first iteration of association rule mining algorithm is a phase
migration and communication). It is clear that computation time of initiation for workload balancing (i.e. creating state vectors and
dominates the time needed for communication and work processing time estimates, etc). For the first dataset (DB70T1M)
migration, which means that the overhead caused by the proposed the algorithm performed 10 iterations in order to generate all
workload balancing strategy could be negligible. possible frequent itemsets. Candidate itemsets migration (intra-
We also tried to use small data sets (like mushroom and chess site) is initiated two times during the second iteration, and once
with Kb sizes) and the results from the experiments show us one during the third and fourth iterations.
important issue of data mining on grids. When the data set used is For the second data set (DB100T13M) candidate itemsets
not big enough, the number of parallel nodes used should be migration (intra-site) is launched two times during the second
decreased or there will not be any improvement on execution iteration, and transaction migration (inter-sites) is established
time. The overhead of transferring data and results is too big when during the third iteration. Another candidate itemsets migration
compared with computing time needed for smaller sets. was needed during the fifth and seventh iteration. The algorithm
for the second data set iterates 14 times in order to generate all
possible frequent itemsets.
(a) DB70T1M
350
Apriori with DB70T1M : speedup
300 loadbalancing
// without loadbalancing // with loadbalancing
250
Computation time 3,5
R u n tim e (s e c )
200
3
150 Communication time
2,5
100
0 1,5
(b) DB100T13M 0
800 0.5% 1% 1.5% 2% 2.5% 3%
Apriori with min support (%)
700 loadbalancing
600
Computation time
500 DB100T13M: speedup
R un tim e (sec)
200
2,5
Work migration
100
Speedup
2
0
0.5% 1% 1.5% 2% 2.5% 3% 1,5
min support (%)
1
Figure 4. Run time, communication time and
workload balancing time for dataSet1. 0,5
0
Figure 5 illustrates the speedup obtained as a function of 0.5% 1% 1.5% 2% 2.5% 3%
different support values. We can clearly see that for both datasets min support (%)
we achieved better speed up with the load balancing approach.
The drop in speedup for relatively higher support values is due Figure 5. Comparing the speedup of parallel Apriori
to the fact that when the support threshold increases the number with and without load balancing
of candidate itemsets generated decreases (i.e. less computation to
be performed). In this case it would be better to decrease the
number of nodes incorporated in execution so that the 7. CONCLUSION
communication cost will not be higher than the computation cost. Data mining could be applied to a diversity of domains, where
In fact, there is not a fixed optimal number of processors that there exist huge amounts of data that need to be analyzed in order
could be used for execution. The number of processors used to provide useful knowledge. The knowledge gained can be used
should be proportional to the size of data sets to be mined. The in applications ranging from business management, production
easiest way to determine that optimal number is via experiments. control, and market analysis, to engineering design and science
exploration. As an example of applications that provides ample
60
opportunities and challenges for data mining, the analysis of [3] J. Han and M.. Kamber. Data Mining : concepts and
software problems of heterogeneous distributed systems. In such techniques. Maurgan Kaufman Publishers, 2000.
systems logs collected from different nodes should be merged in [4] K. Devine, E. Boman, R. Heaphy and B. Hendrickson, “New
order to create a view of what happened in the system. After that Challenges in Dynamic Load Balancing”. Appl. Num. Maths,
an analysis is needed to detect the cause of different problems. In Vol.52, issues 2-3, 133-152, 2005.
this case we can apply both descriptive and predictive data mining
tasks [3]. Clustering (descriptive data mining technique) can be [5] K. Wang, L. Tang, J. Han and J. Liu, “Top Down FP-Growth
used in grouping the collected logs based on some specific criteria for Association Rule Mining”. In Proc. Of the 6th Pacific-
like the type of operating system used. Then an association rule Assia Conf. on Advances in Knowledge Discovery and Data
mining algorithm can be applied on the obtained classes to detect Mining, Taipei, pp. 334-370, 2002.
the kind of behavior related to each operating system. [6] M. H. Willebeek-LeMair and A. P. Reeves, “Strategies for
Regardless of the domain of application of data mining, the Dynamic Load Balancing on Highly Parallel Computers”.
amount of data to be analyzed is very huge. Serial algorithms lose IEEE Transactions on Parallel and Distributed Systems, Vol.
their efficiency and are not able to provide knowledge quickly. 4, No. 9, pages 979-993, September, 1993.
Parallel and distributed systems could be the natural solution to
[7] M. J. Zaki, “Parallel and Distributed Association Mining: a
this problem. But implementing such parallel and distributed
Survey”. IEEE Concurrency, 7(4): pp14-25, 1999.
algorithms is not trivial and is faced by many challenges including
load balancing. [8] M. J. Zaki, S. Parthasarathy, M. Ogihara and W. Li. “New
Data mining algorithms have a dynamic nature during Algorithms for Fast Discovery of Association Rules”.
execution time which causes load-imbalance between the different University of Rochester, Technical Report 651, July 1997.
processing nodes. Such algorithms require dynamic load [9] M.S. Perez, A. Sanchez, V. Robles, P. Herrero, J. Pena,
balancers that adjust the decomposition as the computation Design and Implementation of a data mining grid-aware
proceeds. Numerous static load balancing strategies have been architecture, Future Generation Computing Systems 23 (1),
developed where dynamic load balancing still an open and pp 42–47, 2007.
challenging research area. In this article we developed a dynamic
load balancing strategy for association rule mining algorithms [10] R. Agrawal and J. C. Shafer. “Parallel Mining of Association
(with candidate itemsets generation) under a Grid computing Rules”. IEEE Transactions on Knowledge and Data
environment. Our load balancing strategy acts on three levels. In Engineering , 8:962-969, 1996.
the first level load is adjusted via the migration of work between [11] R. Agrawal and R. Srikant. “Fast Algorithms for Mining
nodes of the same cluster. If the skew in workload still persists Association Rules in Large Databases”. In Proc. of the Int’l
candidate itemsets migration is established between clusters Conf of VLDB’94, pp 478-499, 1994.
within the same site. And finally if it is needed the coordinator [12] Generator of Databases Site :
moves to the third level and searches for the possibility of http://www.almaden.ibm.com/cs/quest.
migrating work between sites.
The load balancing strategy derives several necessary [13] S. Orlando, P. Palmerini and R. Perego, “A Scalable Multi-
coefficients and time estimates at run time. This is done during Strategy Algorithm for Counting Frequent Sets”. In Proc. Of
each iteration of the association rule mining algorithm in order to the 4th International Conference on Knowledge Discovery
respond to the dynamicity of the system. Experiments showed that and Data Mining (KDD), New York, USA, 2002.
our strategy succeeded in achieving better use of the Grid [14] T. L. Casavant and J. G. Kuhl, “Taxonomy of Scheduling in
architecture assuming load balancing and this for large sized General Purpose Distributed Computing Systems”. IEEE
datasets. In the future, we plan to study the impact of the database Transactions on Software Engineering, 14(2): 141, February,
type (dense and sparse) on our strategy. 1988.
[15] V. Fiolet, B. Toursel, Distributed data mining, Scalable
8. REFERENCES Computing: Practice and Experiences 6 (1), pp 99–109,
[1] F. Cappello, E. Caron, M. Dayde, F. Desprez, Y. Jegou, P. 2005.
Vicat-Blanc Primet, E. Jeannot, S. Lanteri, J. Leduc, N. [16] W. Kosters and W. Pijls. Apriori, A Depth First
Melab, G. Mornet, B. Quetier, O. Richard, Grid’5000: a Implementation. In Proceedings of the FIMI Workshop of
large scale and highly reconfigurable grid experimental Frequent Itemset Mining Implementation, Melbourne,
testbed, in: SC’05: Proc. The 6th IEEE/ACM International Florida, USA, 2003.
Workshop on Grid Computing CD, Seattle, Washington,
USA, November 2005, IEEE/ACM, pp. 99–106, 2005. [17] Y. Li and Z. Lan, “A Survey of Load Balancing in Grid
Computing”. Computational and information Science, First
[2] I. Foster and C. Kesselman, The Grid2: Blue print for a New International Symposium, CIS 2004, Shanghai, China, 2004.
Computing Infrastructure. Morgan Kaufmann, 2003.
61