Mining Closed Regular Patterns in Data Streams
Mining Closed Regular Patterns in Data Streams
Department of Computer Science and Engineering, K L University, Guntur, Andhra Pradesh, India
[email protected]
Department of Computer Science and Engineering, LBR College of Engineering, Mylavaram, Andhra Pradesh, India
[email protected]
ABSTRACT
Mining regular patterns in data streams is an emerging research area and also a challenging problem in present days because in Data streams new data comes continuously with varying rates. Closed item set mining gained lot of implication in data mining research from conventional mining methods. So in this paper we propose a narrative approach called CRPDS (Closed Regular Patterns in Data Streams) with vertical data format using sliding window model. To our knowledge no method has been proposed to mine closed regular patterns in data streams. As the stream flows our CRPDS-method mines closed regular itemsets based on regularity threshold and user given support count. The experimental results show that the proposed method is efficient and scalable in terms of memory and time.
KEYWORDS
Data Streams, Regular patterns, closed regular patterns, transaction sliding window.
1. INTRODUCTION
Mining data streams efficiently is a challenging area in a mining research because new data arrives and old data is overdue with rapid speed. Data streams emerging class of applications in recent years which is often continuous, unbounded, high speed and data distribution as time changes [1]. Data streams can be classified into two types, they are a) off line data streams b) on line data streams. Web log reports, queries on data warehouse are examples for offline data streams. Network transactions, bank transactions, sensor data, etc are examples for online data streams. Mining data stream requires fast, real time processing in order to keep up with high speed data arrival and results must be attracted with in short response time. Similarly multiple scans in data streams are not adequate. Recently, Tanbeer et al. [ 8] introduced a new problem of discovering regular patterns that follow temporal regularity or the occurrence behaviour of a pattern. A pattern which is derived from database based on user given regularity threshold is called a regular pattern. Regularity of the item plays an important role in mining process. For example in retail market some products have demand, and it is essential to know how regularly the products are sold rather than number of items sold. They also introduced the same problem in data streams. At present closed itemset mining gained a lot of significance than traditional frequent data mining methods. Literature survey shows that there are many methods derived to mine closed itemsets in
DOI : 10.5121/ijcsit.2013.5114 171
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 1, February 2013
data mining research. Gupta et al., proposed a method CLICI to mine closed itemsets on data streams using formal concept analysis in landmark window model [9]. Pramod. S and Vyas. O. P proposed an algorithm to mine frequent item sets for on line data streams using prefix tree based structure. In this method the items are to be arranged in sorting order in every transaction. In this process it takes additional time to arrange every transaction items in sorted order, similarly it consumes memory also. So in this paper we propose a new method called CRPDS (Closed Regular Patterns in Data Streams) to mine closed regular patterns on data streams using vertical data format in sliding window model. The main idea of our proposed method is to develop a simple, powerful, that captures data from data streams into window using sliding window mechanism to find closed regular itemsets. Sliding window model contains the fixed number of transactions in the window, which lead to maintain constant transaction processing time. However, this model cascade the undersized monitoring of continuous changes of data streams. The rest of the paper is organized as follows. Section 2 describes the related work; section 3 describes problem definition of closed regular pattern mining. The method CRPDS to mine closed regular patterns on data streams using vertical data format with sliding window model is discussed in section 4. In Section 5 we describe experimental results and finally conclude the paper in section 6.
2. RELATED WORK
Discovering interesting patterns in data streams is taxing problem in recent existence. Guohui Li et al., proposed a method for mining frequent patterns in an arbitrary sliding window of data streams by using time decaying model to discriminate patterns of recent transactions with old transactions [2]. In this process as the stream flows SWP tree captures the contents of stream data and infrequent patterns are deleted. OPFI Stream algorithm using prefix tree data structure to mine frequent itemsets with sliding window technique over data streams proposed byKun li et al., in [3]. T.Calders et al., in [11] proposed Optimized incremental algorithm for mining frequent itemsets in data streams. Leung et al., proposed DStree structure for mining the frequent sets in data stream using sliding window [4]. In this process the transactions are arranged based on canonical order specified by user prior to construction of tree later mining process start on this tree data. Giannella et al., derived FP-stream approach to mine frequent patterns in data streams at multiple time granularities [5]. CPS-tree(Compact pattern stream tree) captures recent stream content data and mine complete set of frequent patterns from high speed data stream over sliding window model by avoiding obsolete and old transactional data[12][13]. The occurrence frequency may not be sufficient and temporal regularity in occurrence behaviour of item also required for many data stream application like stock market analysis, network monitoring analysis etc. Traditional frequent mining methods fail to cover occurrence behaviour of itemsets because these methods focused mainly on frequency of item occurrence. Literature survey shows that mining regular patterns in statistical data bases have been addressed. RPS-tree for discovering regular patterns over data streams proposed by Tanbeer S.K. et al., in [6] and in this process sliding window mechanism and tree based structure is used to capture regular itemsets in data streams. VFDT is one algorithm to mine decision trees from continuously changing data in data streams and similarly CVFDT is another algorithm by reapplying VFDT algorithm on moving window every time proposed by Geoff Hulten et.al., in [14]. Vijay Kumar et al., VDSRP method to generate complete set of regular patterns over a data stream at a user given regularity threshold using vertical data format [7].
3. PROBLEM DEFINITION
Let I = {i1, i2, i3, , in} be a set of items. A set X = {i1, i2, , iq} I, where l q and l, q [1, n] is called an item set or a pattern. A transaction t = (tid, Y) is a tuple where tid is a transaction
172
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 1, February 2013
Id and Y is a pattern. The set of transactions T = {t1, t2, , tm} is a transactional database DB over I.
173
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 1, February 2013
Let Table1 contains series of transactions of data stream DS. Data stream contains transaction-id and set of items corresponding to transaction-id i.e., tid and itemset. Consider the transaction sliding window TSW of size 8 i.e., |W| = 8 i.e., from tid-1 to tid-8. Convert the TSW1 transactions in to vertical format (i.e., itemset, tid number) and find the periodicity of each item PX and consider maximum periodicity value as a regularity of an item. Assume maximum regularity threshold = 4 and items which are having regularity is less than or equal to are regular items which are shown in Table2.
Stream flow
TSW1 TSW2
Phase 1
Input: TSW in DS, Output: Set of regular items Procedure 1. Consider TSW of DS, |W| = 8 2. Convert TSW into Vertical data format. 3. Let Xi TSW , Xi k-itemset. 4. PXi = 0 for all Xi 5. For every Xi calculate periodicity
174
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 1, February 2013
PXi = PXi+1 - PXi reg(Xi) = max(PXi) if reg(Xi) < = Xi is regular item set Else Delete Xi Repeat step3 to step11 for i+p itemsets (p = 1, 2, 3, ...)
Table 2. TSW1 in vertical data format
Itemset a b c d e f
tid 1, 5, 7 2, 3, 5 1, 2, 3, 4, 5, 6, 7, 8 4, 6, 7, 8 1,4, 5, 6, 7, 8 1, 2, 3, 8
In Phase I regular itemsets are mined in TSW1 of Data stream DS. First we convert horizontal transactions of TSW1 into vertical data format of size |W| = 8. Xi is an itemset and PXi is periodicity of itemset Xi. We calculate periodicity PX for each itemset and take maximum periodicity as the regularity of an itemset.
Table 3. TSW1 with itemset, PX and Reg
Itemset a b c d e f
tid 1, 5, 7 2, 3, 5 1, 2, 3, 4, 5, 6, 7, 8 4, 6, 7, 8 1,4, 5, 6, 7, 8 1, 2, 3, 8
PX 1, 4, 2, 1 2, 1, 2, 3 1, 1, 1, 1, 1, 1, 1, 1, 1 4, 2, 1, 1 1, 3, 1, 1, 1, 1 1, 1, 1, 5
Reg 4 3 1 4 3 5
Itemsets of TSW1 with their periodicities and their regularities are present in Table 3. For example itemset (d) is appeared in transactions (4, 6, 7, 8) and their periodicity values PX(d) = { 4, 2, 1, 1}.Regularity value is 4 i.e., max(4, 2, 1, 1) = 4. In our running example the minimum regularity threshold is = 4. The itemsets which are having the regularity is less than or equal to minimum regular threshold are regular items. Therefore items {a, b, c, d, e} are regular itemsets and itemset {f} is not a regular itemset which are shown in Table 3. In Table 4 the itemsets {(a c), (a e), (b c),(c d), (c e), (d e)} having the regularity which is less than or equal to are regular itemsets and rest of them are not regular itemsets. Similarly 3itemsets, 4-itemsets, and so on can be mined from the previous regular itemsets. In the second phase closed regular itemsets are mined from previously mined regular itemsets which are shown in table 5 and table 6.
175
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 1, February 2013
tid 5 1, 5, 7 7 1, 5, 7 2, 3, 5 5 4, 6, 7, 8 1, 4, 5, 6, 7, 8 4, 6, 7, 8
PX 5, 3 1, 4, 2,1 7, 1 1, 4, 2, 1 2, 1, 2, 3 5, 3 4, 2, 1, 1 1, 3, 1, 1, 1, 1 4, 2, 1, 1
Reg 5 4 7 4 3 5 4 3 4
Output: Complete set of closed regular itemsets. 1. Let Xi I is a regular p-item set 2. Let Xj I is a regular p+k item set Where k varies from 1 to n 3. Xi Xj for all i <= j 4. Find support-counts of Xi, and Xj i.e., Sup(Xi) and Sup(Xj) 5. For User given minimum support S 6. If Sup(Xi) > Sup(Xj) 7. Xi is closed-regular item set 8. Else 9. Delete Xi
Table 5. Regular one itemsets with Support
Itemset a b c d e
tid 1, 5, 7 2, 3, 5 1, 2, 3, 4, 5, 6, 7, 8 4, 6, 7, 8 1,4, 5, 6, 7, 8
Reg 4 3 1 4 3
Sup 3 3 8 4 6
One itemsets with their regularity and support count values are shown in table 5.Assume S = 4, itemsets (c), (d), (e) are satisfied the specified minimum support value S and itemsets (a), (b) are not satisfied the minimum support value. Two itemsets { (a c),(a e), (b c)} have not been satisfied S and itemsets {(c d),(c e),(d e)} are satisfied S. Apply closed property on the itemsets which have been satisfied regularity and support , Support count of itemset c is greater than support count of itemsets {(c d),(c e)}. So itemset c is a closed regular itemset.
176
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 1, February 2013
Itemset (a c) (a e) (b c) (c d) (c e) (d e)
tid 1, 5, 7 1, 5, 7 2, 3, 5 4, 6, 7, 8 1, 4, 5, 6, 7, 8 4, 6, 7, 8
Reg 4 4 3 4 3 4
Sup 3 3 3 4 6 4
Itemset (a b c) (a b d) (a b e) (a c d) (a c e) (a d e) (b c d) (b c e) (c d e)
The itemset (c d e) satisfied the regularity and support, the support count of itemset (c e) is greater than the support count of itemset (c d e), so itemset (c e) is closed regular itemset and itemsets {(c d), (d e)} are not closed regular itemsets. Like this three itemsets, four itemsets and so on can also be mined until no closed regular itemsets found.
5. EXPERIMENT RESULTS
In this section we produced our results for closed regular patterns in data streams. We used java to develop our algorithm with the system configuration of 2.66 GHz CPU with 2GB main memory on windows XP Operating system. We applied our mining process on Kosarak (real data set) and T10I4D100K (synthetic data set).These data sets are frequently used in frequent pattern mining experiments which are developed at IBM Almaden quest research group and which are obtained from http://cvs.buu.ac.th/mining/datasets/synthesis_data and UCI machine repository (University of California,Irvine, CA). The real data set provided by Ferenc Bodan which contains click stream data of Hungarian on-line news portal. We used T10I4D100K and Kosarak datasets with different regularity and support values to compare our results with RP-tree that finds only regular itemsets. T10I4D100K dataset contains 870 items with average length of10.10 of 1,00,759 total transactions. Kosarak dataset contains 41,270 items with average transaction length of 8.10 of 9,90,000 transactions. To produce our results we consider 100K and 500K size of T10I4D100K and kosarak datasets which are shown in figure1 and figure2 respectively. Experimental results shown that higher max-reg() and minsup() values longer the time required which are exposed in both the graphs.
177
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 1, February 2013
DB Size =500K
max-reg(%)&min-sup(%)
Figure 1. Execution time over Kosarak
6. CONCLUSION
Closed regular pattern mining in data streams is completely a new approach in data mining applications. We Proposed CRPDS algorithm to mine closed regular patterns using vertical data format with sliding window model. The advantage of our proposed algorithm is it requires simple operations like addition, subtraction, arrays etc. Our experimental results have shown the effectiveness of CRPDS in terms of execution time.
ACKNOWLEDGEMENTS
We are very much thankful to Sri G.Vijay Kumar, Associate professor in Department of Computer science and Engineering , K L University, who Supported and contributed towards fulfilment of our work.
178
International Journal of Computer Science & Information Technology (IJCSIT) Vol 5, No 1, February 2013
REFERENCES
[1] [2] [3] [4] [5] Jiang, N., Gruenwald, L., (2006) Research issues in data stream association rule mining. SIGMOD Record 35(1) ,pp14-19. Guohui Li, Hui chen, Bing Yang and Gang chen , (2008) Mining frequent patterns in an arbitrary sliding wnidow over data streams, DASFAA,LNCS 4947, pp 496-503. Kun Li, Yongyan wang, Manzoor Elahi, Xin-Li and Hongan wang (2008-september) Mining recent frequent itemsets in data streams with optimistic pruning Springer, ECME PKDD. Leung, C.K.S, Khan, Q.I.,(2006 December) DStree: A tree structure for mining of frequent sets from data streams. ICDM , IEEE press, Los Alamitos, pp 928-932. Giannell.C, Han,J., Pei,j., Yan.X.,Yu.Ps., (2004) Mining frequent patterns in data streams at multiple time granularities. In Data Mining : Next generation challenges and future directions, AAAI/MIT Press, pp 191-212. S.K. Tanbeer, C. F. Ahmed, B.S. Jeong, and Y.K. Lee, (2010)Mining Regular Patterns in data streams, DASFAA, Springer, Part I, LNCS 5981, pp 399-413. Vijay Kumar, G., Sreedevi, M., Pavan kumar, N.V.S.,(2012) Mining Regular Patterns in data streams using Vertical format, IJCSS volume 6 Issue 2, pp142-149. S.K. Tanbeer, C. F. Ahmed, B.S. Jeong, and Y.K. Lee, (2008) Mining Regular Patterns in Transactional Databases, IEICE Trans. On Information Systems, E91-D, 11, pp. 2568-2577. Anamika Gupta, Vasudha Bhatnagar, and Naveen Kumar (2010) Mining closed itemsets on data streams using formal concept analysis, pp 285-296. S.pramod and O.P.vyas (2010) Frequent Itemset mining over transactional data streams using ItemOrder-Tree IJCSE vol 2 no 8 pp2598-2601. T.Calders,N.Dexters, JJM. Gillis and B.Geothals (2012) Mining frequent itemsets in a stream Information Systems , Elsevier. Syed Khairuzzaman Tanbeer, Chowdhury Farhan Ahmed, Byeong-Soo Jeong, Young-Koo Lee (2009), Sliding window- based frequent pattern mining over data streams Information sciences 179, pp 3843-3865. Syed Khairuzzaman Tanbeer, Chowdhury Farhan Ahmed, Byeong-Soo Jeong, Young-Koo Lee (2008) Efficient frequent pattern mining over data streams ACM 978-1-59593-991.pp 1447-1448. Geoff Hulten, Laurie spencer, and Domingos (2001) Mining time changing data streams ACM01.USA, pp 97-106.
[13] [14]
179