An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining With XML Data For Improved Response Time
An Optimized Distributed Association Rule Mining Algorithm in Parallel and Distributed Data Mining With XML Data For Improved Response Time
AN OPTIMIZED DISTRIBUTED ASSOCIATION RULE MINING ALGORITHM IN PARALLEL AND DISTRIBUTED DATA MINING WITH XML DATA FOR IMPROVED RESPONSE TIME.
Dr (Mrs).Sujni Paul
Associate Professor Department of Computer Applications, Karunya University, Coimbatore 641114 , Tamil Nadu, India
[email protected]
ABSTRACT
Many current data mining tasks can be accomplished successfully only in a distributed setting. The field of distributed data mining has therefore gained increasing importance in the last decade. The Apriori algorithm by Rakesh Agarwal has emerged as one of the best Association Rule mining algorithms. Ii also serves as the base algorithm for most parallel algorithms. The enormity and high dimensionality of datasets typically available as input to problem of association rule discovery, makes it an ideal problem for solving on multiple processors in parallel. The primary reasons are the memory and CPU speed limitations faced by single processors. In this paper an Optimized Distributed Association Rule mining algorithm for geographically distributed data is used in parallel and distributed environment so that it reduces communication costs. The response time is calculated in this environment using XML data.
KEYWORDS
Association rules, Apriori algorithm, parallel and distributed data mining, XML data, response time.
1. INTRODUCTION
Association rule mining (ARM) has become one of the core data mining tasks and has attracted tremendous interest among data mining researchers. ARM is an undirected or unsupervised data mining technique which works on variable length data, and produces clear and understandable results. There are two dominant approaches for utilizing multiple processors that have emerged distributed memory in which each processor has a private memory; and shared memory in which all processors access common memory [4]. Shared memory architecture has many desirable properties. Each processor has direct and equal access to all memory in the system. Parallel programs are easy to implement on such a system. In distributed memory architecture each processor has its own local memory that can only be accessed directly by that processor [10]. For a processor to have access to data in the local memory of another processor a copy of the desired data element must be sent from one processor to the other through message passing. XML data are used with the Optimized Distributed Association Rule Mining Algorithm. A parallel application could be divided into number of tasks and executed concurrently on different processors in the system [9]. However the performance of a parallel application on a distributed system is mainly dependent on the allocation of the tasks comprising the application onto the available processors in the system.
10.5121/ijcsit.2010.2208
88
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
Modern organizations are geographically distributed. Typically, each site locally stores its ever increasing amount of day-to-day data. Using centralized data mining to discover useful patterns in such organizations' data isn't always feasible because merging data sets from different sites into a centralized site incurs huge network communication costs. Data from these organizations are not only distributed over various locations but also vertically fragmented, making it difficult if not impossible to combine them in a central location. Distributed data mining has thus emerged as an active subarea of data mining research. In this paper an Optimized Association Rule Mining Algorithm is used for performing the mining process.
2. RELATED WORK
Three parallel algorithms for mining association rules [2], an important data mining problem is formulated in this paper. These algorithms have been designed to investigate and understand the performance implications of a spectrum of trade-offs between computation, communication, memory usage, synchronization, and the use of problem-specific information in parallel data mining [11]. Fast Distributed Mining of association rules, which generates a small number of candidate sets and substantially reduces the number of messages to be passed at mining association rules [3]. Algorithms for mining association rules from relational data have been well developed. Several query languages have been proposed, to assist association rule mining such as [18], 19]. The topic of mining XML data has received little attention, as the data mining community has focused on the development of techniques for extracting common structure from heterogeneous XML data. For instance, [20] has proposed an algorithm to construct a frequent tree by finding common subtrees embedded in the heterogeneous XML data. On the other hand, some researchers focus on developing a standard model to represent the knowledge extracted from the data using XML. JAM [21] has been developed to gather information from sparse data sources and induce a global classification model. The PADMA system [22] is a document analysis tool working on a distributed environment, based on cooperative agents. It works without any relational database underneath. Instead, there are PADMA agents that perform several relational operations with the information extracted from the documents.
Find all combinations of items that have transaction support above minimum support. Call those combinations frequent itemsets. Use the frequent itemsets to generate the desired rules. The general idea is that if, say, ABCD and AB are frequent itemsets, then we can determine if the rule AB CD holds by computing the ratio r = support(ABCD)/support(AB). The rule holds only if r >= minimum confidence. Note that the
89
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
rule will have minimum support because ABCD is frequent. The algorithm is highly scalable [7]. The Apriori algorithm used in Quest for finding all frequent itemsets is given below.
procedure AprioriAlg() begin L1 := {frequent 1-itemsets}; for ( k := 2; Lk-1 0; k++ ) do { Ck= apriori-gen(Lk-1) ; // new candidates for all transactions t in the dataset do { for all candidates c Ck contained in t do c:count++ } Lk = { c Ck | c:count >= min-support} } Answer := k Lk end
It makes multiple passes over the database. In the first pass, the algorithm simply counts item occurrences to determine the frequent 1-itemsets (itemsets with 1 item). A subsequent pass, say pass k, consists of two phases. First, the frequent itemsets Lk-1 (the set of all frequent (k-1)-itemsets) found in the (k-1)th pass are used to generate the candidate itemsets Ck, using the apriori-gen() function. This function first joins Lk-1 with Lk-1, the joining condition being that the lexicographically ordered first k-2 items are the same. Next, it deletes all those itemsets from the join result that have some (k-1)-subset that is not in Lk-1 yielding Ck. The algorithm now scans the database. For each transaction, it determines which of the candidates in Ck are contained in the transaction using a hash-tree data structure and increments the count of those candidates [8], [14]. At the end of the pass, Ck is examined to determine which of the candidates are frequent, yielding Lk. The algorithm terminates when Lk becomes empty.
90
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
2. After the candidate sets have been generated, two pruning techniques, local pruning and global pruning, are developed to prune away some candidate sets at each individual sites. 3. In order to determine whether a candidate set is large, this algorithm requires only O(n) messages for support count exchange, where n is the number of sites in the network. This is much less than a straight adaptation of Apriori, which requires O(n2 ) messages. [3] Distributed data mining refers to the mining of distributed data sets. The data sets are stored in local databases hosted by local computers which are connected through a computer network [15], [16]. Data mining takes place at a local level and at a global level where local data mining results are combined to gain global findings. Distributed data mining is often mentioned with parallel data mining in literature. While both attempt to improve the performance of traditional data mining systems they assume different system architectures and take different approaches. In distributed data mining computers are distributed and communicate through message passing. In parallel data mining a parallel computer is assumed with processors sharing memory and or disk. Computers in a distributed data mining system may be viewed as processors sharing nothing. This difference in architecture greatly influences algorithm design, cost model, and performance measure in distributed and parallel data mining.
91
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
The speed and the efficiency of parallel formulations are discussed below.In a parallel data mining the main issues taken into account are
Speed Up Serial fraction :
fs
f p = 1 fs
S=
Ts = Tp
1 fs + fp p
For a maximum speed up there are limits to the scalability of parallelism. For example, at fp = 0.50, 0.90 and 0.99, 50%, 90% and 99% of the code is parallelizable. However, certain problems demonstrate increased performance by increasing the problem size. Problems which increase the percentage of parallel time with their size are more "scalable" than problems with a fixed percentage of parallel time.
Efficiency Given the parallel cost: Efficiency E:
p Tp = Ts + To E= Ts 1 = p Tp 1 + To Ts
In general, the total overhead is an increasing function of p, at least linearly when fs > 0: communication, extra computation, idle periods due to sequential components, idle periods due to load imbalance.
4. OPTIMIZED ALGORITHM
DISTRIBUTED
ASSOCIATION
RULE
MINING
The performance of Apriori ARM algorithms degrades for various reasons. It requires n number of database scans to generate a frequent n-itemset. Furthermore, it doesn't recognize transactions in the data set with identical itemsets if that data set is not loaded into the main memory. Therefore, it unnecessarily occupies resources for repeatedly generating itemsets from such identical transactions. For example, if a data set has 10 identical transactions, the Apriori algorithm not only enumerates the same candidate itemsets 10 times but also updates the support counts for those candidate itemsets 10 times for each iteration. Moreover, directly loading a raw data set into the main memory won't find a significant number of identical transactions because each transaction of a raw data set contains both frequent and infrequent items. To overcome these problems, we don't generate candidate support counts from the raw data set after the first pass. This technique not only reduces the average transaction length but also reduces the 92
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
data set size significantly, so we can accumulate more transactions in the main memory. The number of items in the data set might be large, but only a few will satisfy the support threshold.. Consider the sample data set in Figure 1a. If we load the data set into the main memory, then we only find one identical transaction (ABCD), as Figure 1b shows. However, if we load the data set into the main memory after eliminating infrequent items from every transaction (that is, items that don't have 50 percent of support and thus don't occur once in every second transaction in this case, itemset E), we find more identical transactions (see Figure 1c). This technique not only reduces average transaction size but also finds more identical transactions. The following gives the pseudocode of ODAM algorithm [17].
NF={Non-frequent global 1-itemset} for all transaction t D { for all 2-subsets s of t if (s C2) s.sup++: t/=delete_nonfrequent_items(t); Table.add(t/); } Send_to_receiver (C2); /* Global Frequent support counts from receiver*/ F2=receive_from_receiver(F); C3=(Candidate itemset); T=Table.getTransactions(); k=3; While (Ck {}) { For all transaction t T For all k-subsets s of t If (s Ck) s.sup++; k++; send_to_receiver(Ck); /* Generating candidate itemset of k+1 pass*/ Ck+1={Candidate itemset); }
93
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
Transactions No. Transactions No. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Items ABCD BC AB ABCDE ACD ABE CDE AD BCE ABCD 1,10 2. 3. 4. 5. 6. 7. 8. 9. Items ABCD BC AB ABCDE ACD ABE CDE AD BCE
94
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
ODAM eliminates all globally infrequent 1-itemsets from every transaction and inserts them into the main memory; it reduces the transaction size (the number of items) and finds more identical transactions. This is because the data set initially contains both frequent and infrequent items. However, total transactions could exceed the main memory limit. ODAM removes infrequent items and inserts each transaction into the main memory. While inserting the transactions, it checks whether they are already in memory. If yes, it increases that transaction's counter by one. Otherwise, it inserts that transaction into the main memory with a count equal to one. Finally, it writes all main-memory entries for this partition into a temp file. This process continues for all other partitions.
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
<transactions> <transaction id=1> <items> <item> i1</item> <item> i4</item> <item> i7</item> </items> </transaction>
<transaction id=2> <items> <item> i2</item> <item> i3</item> <item> i5</item> </items> </transaction> <transaction id=3> <items> <item> i1</item> <item> i3</item> <item> i7</item> </items> </transaction> <transaction id=4> <items> <item> i2</item> <item> i5</item> </items> </transaction> <transaction id=5> <items> <item> i1</item> <item> i5</item> </items> </transaction>
96
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
Input: min-conf, min-sup, XML file(s) Output: Association Rules error check input D <- new DOM from XML files assign record IDs to non leaf node in D, by record type L <- new linked list construct set S, of all unique record types, with L[1] <- S i <- 1 do until BasketSetCollection generated from merge operation is empty prune elements in L[i-1] not meeting min-sup L[i] <- collection of all unique record types, with elements as sets of size i, constructed from L[i-1] increment i generate all possible rule combinations in L O <- empty output set for all rules R in L, if R meets min-conf add to O output O The basic idea is as follows. First the record ids are assigned per record type. A basketSet is constructed for each type of record encountered. An empty recordTypeList and an empty RIDList is taken first to start. These lists are parallel, in that the recordType at position n of the recordTypeList is associated with the RID at position n of the RIDList. A single path from root to leaf is considered. As the algorithm progresses along this path, it examines each node. If the node is not a leaf, it looks at the node type (recordType) and asks the basketSet associated with this recordType for a new RID. It then adds the recordType to the end of the recordType list and the RID to the end of the RIDList. If the node is a leaf (consider a leaf to be of the form <purchase>pen</purchase>) loop through the RIDList and recordType list to build Baskets. elements as sets of size 1
7. PERFORMANCE EVALUATION
The number of messages that ODAM exchanges among various sites to generate the globally frequent itemsets in a distributed environment, we partition the original data set into five partitions. To reduce the dependency among different partitions, each one contains only 20 percent of the original data set's transactions. So, the number of identical transactions among different partitions is very low. ODAM
97
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
provides an efficient method for generating association rules from different datasets, distributed among various sites.
Support %
Figure 2: Total message size that ODAM transmits to generate the globally frequent itemsets The datasets are generated randomly depending on the number of distinct items, the maximum number of items in each transaction and the number of transactions. The performance of the XQuery implementation is dependent on the number of large itemsets found and the size of the dataset.
The running time for dataset-1 with minimum support 20% is much higher than the running time of dataset-2 and dataset-3, since the number of large itemsets found for dataset-1 is about 2 times more than the other datasets. The Response time of the parallel and distributed data mining task on XML data is carried out by the time taken for communication, computation cost involved. Communication time is largely dependent on the
98
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
DDM operational model and the architecture of the DDM systems. The computation time is the time to perform the mining process on the distributed dat sets. data
A perfect scale up is found when the different types of datasets are considered. An improved response time is achieved for the taken XML data.
8. CONCLUSIONS
Association rule mining is an important problem of data mining. Its a new and challenging area to perform association rule mining on XML data due to the complexity of XML data. In our approach, multiple nesting problems in XML data is handled appropriately to assure the correctness of the result. resul The Optimized Distributed Association Mining Algorithm is used for the mining process in a parallel and distributed environment. The response time with the communication and computation factors are considered to achieve an improved response time. The performance analysis is done by increasing the rformance number of processors in a distributed environment. As the mining process is done in parallel an optimal solution is obtained. The Future enhancement of this is to cluster the same XML dataset and find out the knowledge extracted out of that. A visual analysis can also be made for the same. owledge
REFERENCES
[1] R. Agrawal and R. Srikant , "Fast Algorithms for Mining Association Rules in Large Database," Database,"Proc. 20th Int'l Conf. Very Large Databases (VLDB 94), Morgan Kaufmann, 1994,pp. 407-419. [2] R. Agrawal and J.C. Shafer , "Parallel Mining of Association Rules,"IEEE Tran. Knowledge and 16 IEEE Rules,"IEEE Distributed Systems Online March 2004 Data Eng. , vol. 8, no. 6, 1996,pp. 962-969;. [3] D.W. Cheung , et al., "A Fast Distributed Algorithm for Mining Association Rules," Proc. Parallel and Distributed Distributed Information Systems, IEEE CS Press, 1996,pp. 31 31-42; [4] A. Savasere , E. Omiecinski, and S.B. Navathe , "An Efficient Algorithm for Mining Association Rules in Large Databases,"Proc. 21st Int'l Conf. Very Large Databases (VLDB 94), Morgan Kaufmann, 1995, pp. 432-444. Proc. 432 [5] J. Han , J. Pei, and Y. Yin , "Mining Frequent Patterns without Candidate Generation,"Proc. ACM SIGMOD Generation,"Proc. Int'l. Conf. Management of Data , ACM Press, 2000,pp. 1 1-12.
99
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
[6] M.J. Zaki and Y. Pin , "Introduction: Recent Developments in Parallel and Distributed Data Mining,"J. Distributed and Parallel Databases , vol. 11, no. 2, 2002,pp. 123-127. [7] M.J. Zaki , "Scalable Algorithms for Association Mining,"IEEE Trans. Knowledge and Data Eng.,vol.12 no. 2, 2000,pp. 372-390; [8] J.S. Park , M. Chen, and P.S. Yu , "An Effective Hash Based Algorithm for Mining Association Rules,"Proc. 1995 ACM SIGMOD Int'l Conf. Management of Data , ACM Press, 1995, pp. 175-186. [9] M.J. Zaki , et al., Parallel Data Mining for Association Rules on Shared-Memory Multiprocessors , tech. report TR 618, Computer Science Dept., Univ. of Rochester, 1996. [10] D.W. Cheung , et al., "Efficient Mining of Association Rules in Distributed Databases,"IEEE Trans. Knowledge and Data Eng., vol. 8, no. 6, 1996,pp.911-922; [11] A. Schuster and R. Wolff , "Communication-Efficient Distributed Mining of Association Rules," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 2001,pp. 473-484. [12] M.J. Zaki , "Parallel and Distributed Association Mining: A Survey,"IEEE Concurrency, Oct.- Dec. 1999,pp. 14-25; [13] C.L. Blake and C.J. Merz , UCI Repository of Machine Learning Databases, Dept. of Information and Computer Science, University of California, Irvine, 1998; [14] T. Shintani and M. Kitsuregawa , "Hash-Based Parallel Algorithms for Mining Association Rules,"Proc. Conf. Parallel and Distributed Information Systems, IEEE CS Press, 1996, pp. 19-30; [15] C.C. Aggarwal and P.S. Yu , "A New Approach to Online Generation of Association Rules,"IEEE Trans. Knowledge and Data Eng. , vol. 13, no. 4, 2001,pp.527-540;. [16] G.W. Webb , "Efficient Search for Association Rules,"Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD 00), ACM Press, 2000,pp. 99-107. [17] Mafruz Zaman Ashrafi, David Taniar,Kate Smith, Monash University ODAM: An Optimized Distributed Association Rule Mining Algorithm, IEEE distributed systems online 1541-4922 2004 published by the ieee computer society vol. 5, no. 3; march 2004 [18] T. Imielinski and A. Virmani. MSQL: A query language for database mining. 1999. [19] R. Meo, G. Psaila, and S. Ceri. A new SQLlike operator for mining association rules. In The VLDB Journal, pages 122133, 1996. [20] A. Termier, M.-C. Rousset, and M. Sebag. Mining XML data with frequent trees. In DBFusion Workshop02, pages 8796. [21] A. Prodromidis, P. Chan, and S. Stolfo. Chapter Meta-learning in distributed data mining systems: Issues and approaches. AAAI/MIT Press, 2000. [22] Hillol Kargupta, Ilker Hamzaoglu, and Brian Stafford. Scalable, distributed data mining-an agent architecture. In Heckerman et al. [8], page 211. [23] Albert Y. Zomaya, Tarek El-Ghazawi, Ophir Frieder, Parallel and Distributed Computing for Data Mining, IEEE Concurrency, 1999.
100
International Journal of Computer Science and Information Technology, Volume 2, Number 2, April 2010
[24] Qin Ding, Kevin Ricords, and Jeremy Lumpkin, Deriving General Association Rules from XML DataDepartment of Computer Science Pennsylvania State University at Harrisburg Middletown, PA 17057, USA
Sujni Paul obtained her Bachelors degree in Physics from Manonmanium Sundaranar University during 1997 and Masters Degree in Computer Applications from Bharathiar University during 2000. Completed her PhD in Data Mining in the Department of Computer Applications, Karunya University, Coimbatore, India. She is working in the area of parallel and distributed data mining. She is working in Karunya University as Associate Professor for the past 8 years to till date.
101