2010 - An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining With XML Data For Improved Response Time
2010 - An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining With XML Data For Improved Response Time
ABSTRACT
Many current data mining tasks can be accomplished successfully only in a distributed setting. The field of
distributed data mining has therefore gained increasing importance in the last decade. The Apriori algorithm by
Rakesh Agarwal has emerged as one of the best Association Rule mining algorithms. Ii also serves as the base
algorithm for most parallel algorithms. The enormity and high dimensionality of datasets typically available as
input to problem of association rule discovery, makes it an ideal problem for solving on multiple processors in
parallel. The primary reasons are the memory and CPU speed limitations faced by single processors. In this paper
an Optimized Distributed Association Rule mining algorithm for geographically distributed data is used in parallel
and distributed environment so that it reduces communication costs. The response time is calculated in this
environment using XML data.
KEYWORDS
Association rules, Apriori algorithm, parallel and distributed data mining, XML data, response time.
1. INTRODUCTION
Association rule mining (ARM) has become one of the core data mining tasks and has attracted
tremendous interest among data mining researchers. ARM is an undirected or unsupervised data mining
technique which works on variable length data, and produces clear and understandable results. There are
two dominant approaches for utilizing multiple processors that have emerged distributed memory in
which each processor has a private memory; and shared memory in which all processors access common
memory [4]. Shared memory architecture has many desirable properties. Each processor has direct and
equal access to all memory in the system. Parallel programs are easy to implement on such a system. In
distributed memory architecture each processor has its own local memory that can only be accessed
directly by that processor [10]. For a processor to have access to data in the local memory of another
processor a copy of the desired data element must be sent from one processor to the other through
message passing. XML data are used with the Optimized Distributed Association Rule Mining Algorithm.
A parallel application could be divided into number of tasks and executed concurrently on different
processors in the system [9]. However the performance of a parallel application on a distributed system is
mainly dependent on the allocation of the tasks comprising the application onto the available processors
in the system.
10.5121/ijcsit.2010.2208 88
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
Modern organizations are geographically distributed. Typically, each site locally stores its ever increasing
amount of day-to-day data. Using centralized data mining to discover useful patterns in such
organizations' data isn't always feasible because merging data sets from different sites into a centralized
site incurs huge network communication costs. Data from these organizations are not only distributed
over various locations but also vertically fragmented, making it difficult if not impossible to combine
them in a central location. Distributed data mining has thus emerged as an active subarea of data mining
research. In this paper an Optimized Association Rule Mining Algorithm is used for performing the
mining process.
2. RELATED WORK
Three parallel algorithms for mining association rules [2], an important data mining problem is
formulated in this paper. These algorithms have been designed to investigate and understand the
performance implications of a spectrum of trade-offs between computation, communication, memory
usage, synchronization, and the use of problem-specific information in parallel data mining [11]. Fast
Distributed Mining of association rules, which generates a small number of candidate sets and
substantially reduces the number of messages to be passed at mining association rules [3].
Algorithms for mining association rules from relational data have been well developed. Several query
languages have been proposed, to assist association rule mining such as [18], 19]. The topic of mining
XML data has received little attention, as the data mining community has focused on the development of
techniques for extracting common structure from heterogeneous XML data. For instance, [20] has
proposed an algorithm to construct a frequent tree by finding common subtrees embedded in the
heterogeneous XML data. On the other hand, some researchers focus on developing a standard model to
represent the knowledge extracted from the data using XML. JAM [21] has been developed to gather
information from sparse data sources and induce a global classification model. The PADMA system [22]
is a document analysis tool working on a distributed environment, based on cooperative agents. It works
without any relational database underneath. Instead, there are PADMA agents that perform several
relational operations with the information extracted from the documents.
An association rule mining algorithm, Apriori has been developed for rule mining in large transaction
databases by IBM's Quest project team [3]. An itemset is a non-empty set of items.
They have decomposed the problem of mining association rules into two parts
• Find all combinations of items that have transaction support above minimum support. Call those
combinations frequent itemsets.
• Use the frequent itemsets to generate the desired rules. The general idea is that if, say, ABCD and
AB are frequent itemsets, then we can determine if the rule AB CD holds by computing the ratio r
= support(ABCD)/support(AB). The rule holds only if r >= minimum confidence. Note that the
89
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
rule will have minimum support because ABCD is frequent. The algorithm is highly scalable [7].
The Apriori algorithm used in Quest for finding all frequent itemsets is given below.
procedure AprioriAlg()
begin
L1 := {frequent 1-itemsets};
for ( k := 2; Lk-1 0; k++ ) do {
Ck= apriori-gen(Lk-1) ; // new candidates
for all transactions t in the dataset do {
for all candidates c Ck contained in t do
c:count++
}
Lk = { c Ck | c:count >= min-support}
}
Answer := k Lk
end
It makes multiple passes over the database. In the first pass, the algorithm simply counts item occurrences
to determine the frequent 1-itemsets (itemsets with 1 item). A subsequent pass, say pass k, consists of two
phases. First, the frequent itemsets Lk-1 (the set of all frequent (k-1)-itemsets) found in the (k-1)th pass are
used to generate the candidate itemsets Ck, using the apriori-gen() function. This function first joins Lk-1
with Lk-1, the joining condition being that the lexicographically ordered first k-2 items are the same. Next,
it deletes all those itemsets from the join result that have some (k-1)-subset that is not in Lk-1 yielding Ck.
The algorithm now scans the database. For each transaction, it determines which of the candidates in Ck
are contained in the transaction using a hash-tree data structure and increments the count of those
candidates [8], [14]. At the end of the pass, Ck is examined to determine which of the candidates are
frequent, yielding Lk. The algorithm terminates when Lk becomes empty.
Major cost of mining association rules is the computation of the set of large itemsets in the database.
Distributed computing of large itemsets encounters some new problems. One may compute locally large
itemsets easily, but a locally large itemset may not be globally large. Since it is very expensive to
broadcast the whole data set to other sites, one option is to broadcast all the counts of all the itemsets, no
matter locally large or small, to other sites. However, a database may contain enormous combinations of
itemsets, and it will involve passing a huge number of messages.
A distributed data mining algorithm FDM (Fast Distributed Mining of association rules) has been
proposed by [5], which has the following distinct features.
1. The generation of candidate sets is in the same spirit of Apriori. However, some relationships
between locally large sets and globally large ones are explored to generate a smaller set of
candidate sets at each iteration and thus reduce the number of messages to be passed.
90
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
2. After the candidate sets have been generated, two pruning techniques, local pruning and global
pruning, are developed to prune away some candidate sets at each individual sites.
3. In order to determine whether a candidate set is large, this algorithm requires only O(n) messages
for support count exchange, where n is the number of sites in the network. This is much less than
a straight adaptation of Apriori, which requires O(n2 ) messages. [3]
Distributed data mining refers to the mining of distributed data sets. The data sets are stored in
local databases hosted by local computers which are connected through a computer network [15], [16].
Data mining takes place at a local level and at a global level where local data mining results are combined
to gain global findings. Distributed data mining is often mentioned with parallel data mining in literature.
While both attempt to improve the performance of traditional data mining systems they assume different
system architectures and take different approaches. In distributed data mining computers are distributed
and communicate through message passing. In parallel data mining a parallel computer is assumed with
processors sharing memory and or disk. Computers in a distributed data mining system may be viewed as
processors sharing nothing. This difference in architecture greatly influences algorithm design, cost
model, and performance measure in distributed and parallel data mining.
1. Count Distribution – parallelizing the task of measuring the frequency of a pattern inside a
database
2. Candidate Distribution – parallelizing the task of generating longer patterns
3. Hybrid Count and Candidate Distribution – a hybrid algorithm that tries to combine the strengths
of the above algorithms
4. Sampling with Hybrid Count and Candidate Distribution – an algorithm that tries to only use a
sample of the database.
91
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
The speed and the efficiency of parallel formulations are discussed below.In a parallel data mining the
main issues taken into account are
• Load balancing
• Minimizing communication
• Overlapping communication and computation
Speed Up
Serial fraction : fs
Parallel fraction : f p = 1− fs
Speedup : Ts 1
S= =
Tp fp
fs +
p
For a maximum speed up there are limits to the scalability of parallelism. For example, at fp = 0.50, 0.90
and 0.99, 50%, 90% and 99% of the code is parallelizable. However, certain problems demonstrate
increased performance by increasing the problem size. Problems which increase the percentage of parallel
time with their size are more "scalable" than problems with a fixed percentage of parallel time.
Efficiency
• communication,
• extra computation,
• idle periods due to sequential components,
• idle periods due to load imbalance.
92
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
data set size significantly, so we can accumulate more transactions in the main memory. The number of
items in the data set might be large, but only a few will satisfy the support threshold..
Consider the sample data set in Figure 1a. If we load the data set into the main memory, then we only find
one identical transaction (ABCD), as Figure 1b shows. However, if we load the data set into the main
memory after eliminating infrequent items from every transaction (that is, items that don't have 50 percent
of support and thus don't occur once in every second transaction in this case, itemset E), we find more
identical transactions (see Figure 1c). This technique not only reduces average transaction size but also
finds more identical transactions. The following gives the pseudocode of ODAM algorithm [17].
if (s £ C2) s.sup++:
t/=delete_nonfrequent_items(t);
Table.add(t/);
Send_to_receiver (C2);
F2=receive_from_receiver(Fα);
C3=(Candidate itemset);
T=Table.getTransactions(); k=3;
If (s £ Ck) s.sup++;
k++;
send_to_receiver(Ck);
Ck+1={Candidate itemset);
93
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
Transactions
No. Items
1,10 ABCD
Transactions
2. BC
No. Items
3. AB
1. ABCD
4. ABCDE
2. BC
5. ACD
3. AB
6. ABE
4. ABCDE
7. CDE
5. ACD
8. AD
6. ABE
9. BCE
7. CDE
9. BCE
10. ABCD
No. Items
1,4,10 ABCD
2.9 BC
3,6 AB
5 ACD
7 CDE
8 AD
94
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
ODAM eliminates all globally infrequent 1-itemsets from every transaction and inserts them into the main
memory; it reduces the transaction size (the number of items) and finds more identical transactions. This
is because the data set initially contains both frequent and infrequent items. However, total transactions
could exceed the main memory limit.
ODAM removes infrequent items and inserts each transaction into the main memory. While inserting the
transactions, it checks whether they are already in memory. If yes, it increases that transaction's counter
by one. Otherwise, it inserts that transaction into the main memory with a count equal to one. Finally, it
writes all main-memory entries for this partition into a temp file. This process continues for all other
partitions.
Hardware Platform
Shared memory architecture has all the processors access common memory. Each processor has direct
and equal access to all the memory in the system. Parallel programs are easy to implement on such a
system.
Data Parallelism
The data warehouse is partitioned among P processors logically partitioned for SMP. Each processor
works on its local partition of the database but performs the same computation of counting support.
Load Balancing
Dynamic load balancing seeks to address this issue by balancing the load and reassigning the loads to the
lighter ones. The development of distributed rule mining is a challenging and critical task, since it
requires knowledge of all the data stored at different locations and the ability to combine partial results
from individual databases into a single result.
The association rule from XML data with a sample XML document is considered. We refer to the sample
XML document, depicted in Figure 2 where information about the items purchased in each transaction are
represented. For example, the set of transactions are identified by the tag <transactions> and each
transaction in the transactions set is identified by the tag <transaction>. The set of items in each
transaction Figure 2: Transaction document (transactions.xml) are identified by the tag <items> and an
item is identified by the tag <item>. Consider the problem of mining all association rules among items
that appear in the transactions document as shown in Figure 2. With the understanding of traditional
association rule mining we expect to obtain the large itemsets document and association rules document
from the source document.
95
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
<transactions>
<transaction id=1>
<items>
<item> i1</item>
<item> i4</item>
<item> i7</item>
</items>
</transaction>
<transaction id=2>
<items>
<item> i2</item>
<item> i3</item>
<item> i5</item>
</items>
</transaction>
<transaction id=3>
<items>
<item> i1</item>
<item> i3</item>
<item> i7</item>
</items>
</transaction>
<transaction id=4>
<items>
<item> i2</item>
<item> i5</item>
</items>
</transaction>
<transaction id=5>
<items>
<item> i1</item>
<item> i5</item>
</items>
</transaction>
96
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
construct set S, of all unique record types, with elements as sets of size 1
L[1] <- S
i <- 1
L[i] <- collection of all unique record types, with elements as sets of size i, constructed from L[i-1]
increment i
output O
The basic idea is as follows. First the record ids are assigned per record type. A basketSet is constructed
for each type of record encountered. An empty recordTypeList and an empty RIDList is taken first to
start. These lists are parallel, in that the recordType at position n of the recordTypeList is associated with
the RID at position n of the RIDList. A single path from root to leaf is considered. As the algorithm
progresses along this path, it examines each node. If the node is not a leaf, it looks at the node type
(recordType) and asks the basketSet associated with this recordType for a new RID. It then adds the
recordType to the end of the recordType list and the RID to the end of the RIDList. If the node is a leaf
(consider a leaf to be of the form <purchase>pen</purchase>) loop through the RIDList and recordType
list to build Baskets.
7. PERFORMANCE EVALUATION
The number of messages that ODAM exchanges among various sites to generate the globally frequent
itemsets in a distributed environment, we partition the original data set into five partitions. To reduce the
dependency among different partitions, each one contains only 20 percent of the original data set's
transactions. So, the number of identical transactions among different partitions is very low. ODAM
97
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
provides an efficient method for generating association rules from different datasets, distributed among
various sites.
Figure 2: Total message size that ODAM transmits to generate the globally frequent itemsets
The datasets are generated randomly depending on the number of distinct items, the maximum number of
items in each transaction and the number of transactions. The performance of the XQuery
implementation is dependent on the number of large itemsets found and the size of the dataset.
The running time for dataset-1 with minimum support 20% is much higher than the running time of
dataset-2 and dataset-3, since the number of large itemsets found for dataset-1 is about 2 times more than
the other datasets.
The Response time of the parallel and distributed data mining task on XML data is carried out by the time
taken for communication, computation cost involved. Communication time is largely dependent on the
98
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
DDM operational model and the architecture of the DDM systems. The computation time is the time to
perform the mining process on the distributed dat
data sets.
A perfect scale up is found when the different types of datasets are considered. An improved response
time is achieved for the taken XML data.
8. CONCLUSIONS
Association rule mining is an important problem of data mining. It’s a new and challenging area to
perform association rule mining on XML data due to the complexity of XML data. In our approach,
multiple nesting problems in XML data is handled appropriately to assure the correctness of the result.
resul
The Optimized Distributed Association Mining Algorithm is used for the mining process in a parallel and
distributed environment. The response time with the communication and computation factors are
considered to achieve an improved response time. The performance
rformance analysis is done by increasing the
number of processors in a distributed environment. As the mining process is done in parallel an optimal
solution is obtained. The Future enhancement of this is to cluster the same XML dataset and find out the
knowledge
owledge extracted out of that. A visual analysis can also be made for the same.
REFERENCES
[1] R. Agrawal and R. Srikant , "Fast Algorithms for Mining Association Rules in Large Database,"Proc.
Database," 20th Int'l
Conf. Very Large Databases (VLDB 94), Morgan Kaufmann, 1994,pp. 407-419.
[4] A. Savasere , E. Omiecinski, and S.B. Navathe , "An Efficient Algorithm for Mining Association Rules in Large
Databases,"Proc.
Proc. 21st Int'l Conf. Very Large Databases (VLDB 94), Morgan Kaufmann, 1995, pp. 432-444.
432
[5] J. Han , J. Pei, and Y. Yin , "Mining Frequent Patterns without Candidate Generation,"
Generation,"Proc.
Proc. ACM SIGMOD
Int'l. Conf. Management of Data , ACM Press, 2000,pp. 11-12.
99
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
[6] M.J. Zaki and Y. Pin , "Introduction: Recent Developments in Parallel and Distributed Data Mining,"J.
Distributed and Parallel Databases , vol. 11, no. 2, 2002,pp. 123-127.
[7] M.J. Zaki , "Scalable Algorithms for Association Mining,"IEEE Trans. Knowledge and Data Eng.,vol.12 no. 2,
2000,pp. 372-390;
[8] J.S. Park , M. Chen, and P.S. Yu , "An Effective Hash Based Algorithm for Mining Association Rules,"Proc.
1995 ACM SIGMOD Int'l Conf. Management of Data , ACM Press, 1995, pp. 175-186.
[9] M.J. Zaki , et al., Parallel Data Mining for Association Rules on Shared-Memory Multiprocessors , tech. report
TR 618, Computer Science Dept., Univ. of Rochester, 1996.
[10] D.W. Cheung , et al., "Efficient Mining of Association Rules in Distributed Databases,"IEEE Trans.
Knowledge and Data Eng., vol. 8, no. 6, 1996,pp.911-922;
[11] A. Schuster and R. Wolff , "Communication-Efficient Distributed Mining of Association Rules," Proc. ACM
SIGMOD Int'l Conf. Management of Data, ACM Press, 2001,pp. 473-484.
[12] M.J. Zaki , "Parallel and Distributed Association Mining: A Survey,"IEEE Concurrency, Oct.- Dec. 1999,pp.
14-25;
[13] C.L. Blake and C.J. Merz , UCI Repository of Machine Learning Databases, Dept. of Information and
Computer Science, University of California, Irvine, 1998;
[14] T. Shintani and M. Kitsuregawa , "Hash-Based Parallel Algorithms for Mining Association Rules,"Proc. Conf.
Parallel and Distributed Information Systems, IEEE CS Press, 1996, pp. 19-30;
[15] C.C. Aggarwal and P.S. Yu , "A New Approach to Online Generation of Association Rules,"IEEE Trans.
Knowledge and Data Eng. , vol. 13, no. 4, 2001,pp.527-540;.
[16] G.W. Webb , "Efficient Search for Association Rules,"Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge
Discovery and Data Mining (KDD 00), ACM Press, 2000,pp. 99-107.
[17] Mafruz Zaman Ashrafi, David Taniar,Kate Smith, Monash University “ODAM: An Optimized Distributed
Association Rule Mining Algorithm”, IEEE distributed systems online 1541-4922 © 2004 published by the ieee
computer society vol. 5, no. 3; march 2004
[18] T. Imielinski and A. Virmani. MSQL: A query language for database mining. 1999.
[19] R. Meo, G. Psaila, and S. Ceri. A new SQLlike operator for mining association rules. In The VLDB Journal,
pages 122–133, 1996.
[20] A. Termier, M.-C. Rousset, and M. Sebag. Mining XML data with frequent trees. In DBFusion Workshop’02,
pages 87–96.
[21] A. Prodromidis, P. Chan, and S. Stolfo. Chapter Meta-learning in distributed data mining systems: Issues and
approaches. AAAI/MIT Press, 2000.
[22] Hillol Kargupta, Ilker Hamzaoglu, and Brian Stafford. Scalable, distributed data mining-an agent architecture.
In Heckerman et al. [8], page 211.
[23] Albert Y. Zomaya, Tarek El-Ghazawi, Ophir Frieder, “Parallel and Distributed Computing for Data Mining”,
IEEE Concurrency, 1999.
100
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
[24] Qin Ding, Kevin Ricords, and Jeremy Lumpkin, “Deriving General Association Rules from XML
Data”Department of Computer Science Pennsylvania State University at Harrisburg Middletown, PA 17057, USA
Sujni Paul obtained her Bachelors degree in Physics from Manonmanium Sundaranar University during 1997 and
Masters Degree in Computer Applications from Bharathiar University during 2000. Completed her PhD in Data
Mining in the Department of Computer Applications, Karunya University, Coimbatore, India. She is working in the
area of parallel and distributed data mining. She is working in Karunya University as Associate Professor for the
past 8 years to till date.
101