Y JAYA BABU* et al.
[IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY
ISSN: 22503676
Volume - 2, Issue - 1, 79 84
EXTRACTING SPATIAL ASSOCIATION RULES FROM THE MAXIMUM FREQUENT ITEMSETS BASED ON BOOLEAN MATRIX
Prof. Y Jaya Babu1, G J Phani Bala 2, Siva Rama Krishna T 3
1
Prof. & Head, Dept. of MCA, Pragati Engg. College, Andhra Pradesh, India, [email protected] 2 Asst. Prof., Dept. of IT, Pragati Engg. College, Andhra Pradesh, India, [email protected] 3 Asst. Prof., Dept. of CSE, Vishnu Inst. of Technology, Andhra Pradesh, India, [email protected]
Abstract
Mining spatial association rules is one of the most important branches in the field of Spatial Data Mining (SDM). Because of the complexity of spatial data, a traditional method in extracting spatial association rules is to transform spatial database into general transaction database. The Apriori algorithm is one of the most commonly used methods in mining association rules at present. But a shortcoming of the algorithm is that its performance on the large database is inefficient. The present paper proposed a new algorithm by extracting maximum frequent itemsets based on a Boolean matrix. And a case study about extracting the spatial association rules between land cover and terrain factors was demonstrated to show the validation of the new algorithm. Finally, the conclusion was reached by the comparison between the Apriori algorithm and the new one which revealed that the new algorithm improves the efficiency of extracting spatial association rules.
Index Terms: Maximum frequent itemset; Spatial association rule; Apriori algorithm --------------------------------------------------------------------- *** -----------------------------------------------------------------------1. INTRODUCTION
Spatial Data Mining (SDM) is a process of spatial support decision, which aims at extracting the implicit, unknown, potential, useful spatial and non-spatial knowledge from spatial data, including general geometry rules, spatial characteristics rules, spatial classification rules, spatial clustering rules, spatial association rules and so on [1]. Spatial association rule, termed as spatial association location pattern [2], is one of the most important branches in the SDM, which means a rule indicating certain association relationships among a set of spatial and nonspatial attributes of geographical objects. Because of the complexity of spatial data, the main idea of extracting spatial association rules is to mine spatial association rules in the transaction database categorized from spatial data using some mining algorithms. The Apriori algorithm [3] is one of the most commonly used algorithms in mining association rules at present, and its typical application was market basket analysis to discover customer shopping patterns [4]. Subsequently, the algorithm was extended towards SDM to discover multi-level spatial association rules based on progressive refinement [5]; But a shortcoming about the algorithm is that the performance is inefficient on the large database, especially, the deficiency is more obvious for an amount of spatial data. Although the meta-rules can reduce the computation of the number of unnecessary itemsets, the metarules were re-designed and users accepted them passively. Therefore, two models were proposed to learn the prior knowledge from users interactive feedback [10]. In this paper, a new algorithm was proposed that focus on extracting maximum frequent itemsets first based on the Boolean matrix of frequent length-1 itemsets that are generated using the Apriori algorithm, and then generating all the frequent itemsets from maximum frequent itemsets according to the nonempty sub-sets of frequent itemsets being still frequent. Finally, the comparison between the Apriori algorithm and the proposed one by mining the spatial association rules between terrain factors and land cover was showed to validate the new algorithms efficiency.
2. THE COMPARISON OF THE PRINCIPLES OF THE ALGORITHMS
The Apriori algorithm is one of the most influential algorithms used for mining association rules, which was proposed by R. Aglawal et al. in 1994. According to the principles of the Apriori algorithm in [3], it is composed of two steps, one is extracting all the frequent itemsets; the other is generating all the strong association rules from frequent itemsets [6]. In fact, the essence is to iteratively generate the set of candidate itemsets of length (k+1) from frequent itemsets of length-k and check their corresponding occurrence frequencies in the
IJESAT | Jan-Feb 2012
Available online @ http://www.ijesat.org 79
Y JAYA BABU* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY database to obtain frequent itemsets of length (k+1) at each level. Therefore it can be seen that there are two main reasons to low efficiency of the Apriori algorithm: It is required to generate lots of candidate itemsets for generating each frequent itemsets; It is essential to scan database many times for generating each frequent itemsets. Thus the research presented a new algorithm of mining maximum frequent itemsets first based on the Boolean matrix of frequent length-1 itemsets. The main idea of the algorithm is to create a Boolean matrix with frequent length-1 itemsets as row headings and transaction records IDs as column headings (TABLE I). In the matrix, there are only two type of values, 1 and 0, which means that the transaction record contains or not the corresponding frequent length-1 itemset. Then it is necessary to calculate the number of value 1 in each column and the count of the columns with the same number of value 1. If the count of those columns is larger than the minimum support, in accordance, the number of value 1 in the column may be the size of maximum frequent itemset, vice versa. Therefore, some values of which each may be the maximum frequent itemsets length will be calculated. Subsequently, a set of candidate itemsets used for extracting maximum frequent itemsets will be generated from frequent length-1 itemsets according to each maximum value and the support of each candidate itemset will be calculated based on the Boolean matrix. If the support is larger than the minimum support, the candidate itemset is frequent, vice versa. Finally, all the frequent itemsets will be extracted from maximum frequent itemsets according to the nonempty sub-sets of frequent itemsets being still frequent. Generally speaking, the main principles of the new algorithm include three aspects:
ISSN: 22503676
Volume - 2, Issue - 1, 79 84
Boolean array with the length being the number of the transaction records in database will be created for each frequent length-1 itemset. In each array, there are only two values, 0 and 1. If transaction record contains frequent length-1 itemset, the value is 1 in the corresponding Boolean array, vice versa. At last, a Boolean matrix will be constructed according to all the Boolean arrays of frequent length-1 itemsets. 1) Definition 1: The corresponding Boolean array ofeach frequent length-1 itemset Im[N] is {BT1, BT2, ... , BTn} (1 nN), where Im is the mth frequent length-1 itemset; N is the number of transaction records in database; Tn is ID of the nth transaction record respectively; and BTns value is 0 or 1 only. 2) Definition 2: The Boolean matrix of frequent length-1 itemsets IM*N is {I1[N], I2[N], ... , Im[N]} (1mM), where Im[N] is the Boolean array with N dimensions of the mth frequent length-1 itemset; M is the number of frequent length1 itemsets. 3) The pesudo codes of the first part (Fig. 1):
Table 1
A Part Of The Boolean Matrix Of The Frequent Length-1 Itemsets
2.1 Creating a Boolean Matrix According to Frequent Length-1 Itemsets
All the frequent length-1 itemsets will be generated from transaction database using the Apriori algorithm when transaction database is scanned first time and for each frequent length-1 itemset, all the IDs of transaction records containing it need to be taken note in one array. Then the corresponding Figure 1: The pesudo codes of the first part
IJESAT | Jan-Feb 2012
Available online @ http://www.ijesat.org 80
Y JAYA BABU* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY
ISSN: 22503676
Volume - 2, Issue - 1, 79 84
2.2 Extracting Maximum Frequent Itemsets from Boolean Matrix
Each column in the Boolean matrix represents one transaction record. Value 0 in the column means the corresponding transaction record contains the corresponding frequent length1 itemset, vice versa. Therefore, the number of value 1 in each column indicates the corresponding transaction record contains the number of frequent length-1 itemsets together. If there is the number of transaction records with the same number of value 1 being larger than the minimum support, the number of value 1 may be the size of maximum frequent itemset, vice versa. As a result, a set of values in which each one may be maximum frequent itemsets length will be obtained. Then according to each of the values in descending order, a series of candidate itemsets will be generated from frequent length-1 itemsets and the support of each candidate itemset could be calculated according to the Boolean matrix of frequent length-1 itemsets. If the support of each candidate itemset is larger than the minimum support, the candidate itemset is frequent, vice versa. At last, if the maximum frequent itemsets generated from the set of candidate itemsets are not empty, the size of candidate itemset is required, that is length of maximum frequent itemset. Otherwise, it is necessary to continue the previous operation to check the next value until maximum frequent itemsets are not empty. If all the maximum frequent itemsets are empty, the maximum length of frequent itemset is one. 1) Definition 3: Max[n] is an array used for storing some values of which each may be the length of maximum frequent itemset, where n is the size of Max[n]. 2) Definition 4: The set of candidate itemsets of maximum frequent itemsets C is {IM1, IM2, ... , IMn}, therefore, the corresponding Boolean matrix CMn*N is {IM1[N], IM2[N], ... , IMn[N]}, where IMn is candidate itemset. 3) Definition 5: The support of candidate itemset C, Support(C) = IM1[N] And IM2[N] And ... IMn[N]. Fig. 3 shows the example of the logical Boolean operator And between the Boolean arrays of candidate itemsets, where And is the logical Boolean operator, if there exists value 0, then the calculation will be 0. 4) The pesudo codes of the second part (Fig. 2):
Figure 2: The pesudo codes of the second part
Figure 3: The logical Boolean operator of the Boolean arrays of the set of candidate itemsets
2.3 Generating All the Frequent Itemsets from Maximum Frequent Itemsets
IJESAT | Jan-Feb 2012
Available online @ http://www.ijesat.org 81
Y JAYA BABU* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY All the frequent itemsets could be extracted from all the maximum frequent itemsets according to the nonempty subsets of frequent itemsets being still frequent. And the support of each frequent itemset could be calculated by Definition 5. At last, all the strong association rules can be mined from all the frequent itemsets.
ISSN: 22503676
Volume - 2, Issue - 1, 79 84
3. THE IMPLEMENTS AND COMPARISON OF THE ALGORITHMS
In this part, a case study about extracting the spatial association rules between land cover and terrain factors was presented to validate the proposed algorithms efficiency. The slope and the aspect derived from DEM with the grid cell size of 100m and the land cover map extracted from the SPOT-5 remote sensing image were taken as experimental datasets; the Apriori algorithm and the new one were used to mine the spatial association rules from the above datasets and the efficiency between the two algorithms was be compared and analyzed at last.
Figure 4: The flow chart of the spatial data preprocessing 2) The categorization of the attribute values for each spatial dataset: According to the spatial data preprocessing framework, the attribute values of each spatial dataset must be generalized. Therefore, the elevation would be categorized into 5 types, including extremely High Mountain (>5000m), High Mountain (3500~5000m), Middle Mountain (1000~3000m), Low Mountain (500~1000m) and Plain and Hill (<500m) according to [17] and the slope could be generalized into 4 types based on the slope steepness classification of International Geographical Union Geomorphological Survey and Mapping Council, including plain (<2), slope (2~6), abrupt slope (6~25) and steep slope (>25). Fig. 5 shows the compass direction of the aspect. The land cover types included river, estuarine, reservoir, builtup land, farmland, gardens, forest land, mangrove, grass land, and so on.
3.1 Spatial Data Preprocessing
Spatial datasets need to be preprocessed to construct the transaction database before mining spatial association rules according to the main idea of mining spatial association rules at present. Imam Mukhlash and Benhard Sitohang put forward the framework of spatial data preprocessing, including feature (spatial and non-spatial) selection based on spatial parameters, performing dimension reduction and selection of non-spatial attributes, performing data categorization based on non-spatial data parameters, performing join operations for spatial objects based on spatial parameters and transforming into output form [16]. Therefore, all the spatial datasets in the case need be preprocessed as the following three aspects: 1) The preparation and preprocessing of spatial datasets: The spatial datasets in the case included the elevation, the slope and the aspect with the spatial resolution of 100m and the land cover map. The slope and the aspect were derived from the elevation and the land cover map was derived from the SPOT5 remote sensing image. Fig. 4 shows the flow chart of the spatial data preprocessing. At last, all the spatial datasets are masked by the study region boundary layer to be sure the same spatial extent for each spatial dataset.
Figure 5: The compass direction of the aspect
IJESAT | Jan-Feb 2012
Available online @ http://www.ijesat.org 82
Y JAYA BABU* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY
ISSN: 22503676
Volume - 2, Issue - 1, 79 84
3) The construction of the transaction database: After completing the spatial data preprocessing, to construct the transaction database, each grid cell was treated as one transaction record with 5 parts, including TID (the grid cells ID), the slope category, the aspect category, the elevation category and the land cover type. Then its required to read and classify quickly the attribute values of each grid cell from all the raster datasets to construct the attribute transaction table according to the categories of all the spatial datasets. At last, the Apriori algorithm and the new one were applied to extracting the frequent itemsets from the constructed transaction database. The program of all the above tasks was implemented by using of c# programming language (Fig. 6).
itemsets according to the comparison of the principles between two algorithms. Therefore, the proposed algorithm in this paper is superior to the Apriori.
Figure 7: The comparison of between the proposed algorithm and the Apriori
4. CONCLUSIONS
The paper represented a new algorithm used for mining spatial association rulesextracting maximum frequent itemsets first based on a Boolean matrix. The algorithm not only reduced the times of scanning transaction database, but also decreased the number of the set of candidate itemsets. However, some problems about the algorithm should still be taken into consideration further: First, in the Boolean matrix of frequent length-1 itemsets, there may be lots of successive values 0 so as to waste memory resource to some extent. Although compressing the matrix can solve the problem, in contrast, uncompressing the matrix may lower the efficiency of the algorithm; Second, the algorithm is lack of evaluating the quality of frequent itemsets, especially, interpreting and understanding the significance of frequent itemsets: Third, the auto-correlation between spatial objects is not be considered in the new algorithm. Finally, the above three aspects will be emphasized in the future work.
Figure 6: The procedure of the extracting the spatial association rules
3.2 Comparison of the Algorithms Efficiencies
After the transaction database with the number of transaction records being 157155 was constructed, the procedure as shown in Fig. 6 was performed on the computer with Pentium (R) Dual-Core2.60GHz CPU and 2GB memory to extract all the frequent itemsets with the minimum supports as 100, 200, 400, 800, 1600, 3200, 6400, 12800, 25600 and the spatial association rules with the minimum support and confidence being 2% and 30% respectively. At last the output of the procedure was shown in Fig. 7. It can be seen obviously that the runtime of the new algorithm is less than the Aprioris for each minimum support. As the minimum support grew smaller, the runtime of the two algorithms both increased continuously, but the growth rate of the new algorithms runtime was much less than the Aprioris. And the new algorithm not only reduced times of scanning transaction database, but also decreased the number of the set of candidate
REFERENCES
[1] D. Li, S. Wang, and D. Li, Spatial Data Mining Theories and Applications, Beijing: Publisher of Science, 2006, pp. 3236. [2] R. Ma, Y. Pu, and X. Ma, Mining Spatial Association Patterns from GIS Database, Beijing: Publisher of Science, 2007, pp. 68-69.
IJESAT | Jan-Feb 2012
Available online @ http://www.ijesat.org 83
Y JAYA BABU* et al. [IJESAT] INTERNATIONAL JOURNAL OF ENGINEERING SCIENCE & ADVANCED TECHNOLOGY [3] R. Agrawal, T. Imelinski, and A. Swami, Mining Association Rules Between Sets of Items in Large Database, Proc. ACM-SIGM OD International Conference, pp. 208-216, 1993. [4] J. Han and M. Kamber, Data Mining Concepts and Techniques, Beijing: China Machine Press, 2007, pp. 149-152. [5] K. Koperski and J. Han, Discovery of spatial association rules in geographic information databases, Lecture Notes In Computer Science, vol. 951, 1995, pp. 47-66. [6] A. Salleb and C. Vrain, An application of association rules discovery to geographic information systems, Proc. The 4th European Conference on Principles of Data Mining and Knowledge Discovery PKDD, pp. 613-618, 2000. [7] G. Chen, Z. He, and B. Yang, Spatial Association Rules Data Mining Research on Terrain Feature and Mountain Climate Change, Geography and Geo-Information Science, vol. 26(1), 2010, pp. 37-40. [8] Y. Fu and J. Han, Meta-Rule-Guided Mining of Association Rules in Relational Databases, Proc. Intl Workshop on Internation of Knowledge Discovery with Deductive and Objective and Object-Oriented Databases, pp. 39-46, 1995,. [9] C. Yuan and F. Xiong, Meta-rule-guided Mining Multiple-level Spatial Association Rules Based on Progressive Refinement, Computer Engineering, vol. 30(8), 2004, pp. 3436. [10] D. Xin, X. Shen, Q. Mei, and J. Han, Discovering Interesting Patterns Through Users Interactive Feedback, Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD06), Philadelphia, Pennsylvania, USA, August 20-23, 2006. [11] J. Han, J. Pei, Y. Yin, and R. Mao, Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach, Data Mining and Knowledge Discovery, 2004, pp. 53-87. [12] R. Ma, X. Ma, and Y. Pu, Spatial Association Rule Mining from GIS Database, Journal of Remote Sensing, vol. 9(6), 2005, pp. 733-741. [13] L. Wang, K. Xie, T. Chen, and X. Ma, Efficient discovery of multilevel spatial association rules using partitions, Information and Software Technology, vol. 47, 2005, pp. 829-840. security protocols.
ISSN: 22503676
Volume - 2, Issue - 1, 79 84
[14] A. J. T. Lee, R. Hong, W. Ko, W. Tsao, and H. Lin, Mining spatial association rules in image databases, Information Sciences, vol. 177, 2007, pp. 1593-1608. [15] Y. Zhang, Research of Frequent Itemsets Mining Algorithm Based on 0-1 Matrix, Computer Engineering and Design, vol. 30(20), 2009, pp. 4662-4664. [16] I. Mukhlash and B. Sitohang, Spatial Data Preprocessing for Mining Spatial Association Rule with Conventional Association Mining Algorithms, Proc. The International Conference on Electrical Engineering and Informatics, Institute Teknologi Bandung, Indonesia, pp. 531-534, June 1719, 2007. [17] Physical Regionalization Working Committee of Chinese Academy of Science, Geomorphological Regionalization of China, Beijing: Publisher of Science, 1959.
BIOGRAPHIES
Prof. Y Jaya Babu is currently heading the department of Computer Applications, Pragati Engineering College. He is a postgraduate in Computer Science and Technology and had 18 years of teaching and research experience. His research interests include spatial data mining, web mining and data warehousing. Mrs. G J Phani Bala is an Assistant Professor in the department of Information Technology, Pragati Engineering College. She is graduated in Computer Science and Engineering and had 5 years of teaching and research experience. Her research interests include data mining, 2D object rendering and image processing. Mr. Siva Rama Krishna T is an Assistant Professor in the department of Computer Science and Engineering, Vishnu Institute of Technology. He is a postgraduate in Computer Networks and had 3 years of teaching and research experience. His research interests include data mining, cloud computing and
IJESAT | Jan-Feb 2012
Available online @ http://www.ijesat.org 84