A New Method For Mining Maximal Frequent Itemsets Based On Graph Theory
A New Method For Mining Maximal Frequent Itemsets Based On Graph Theory
(
A New Method for Mining Maximal Frequent Itemsets based on Graph Theory
Abstract—Mining itemsets plays an important role in all Notably, many of conducted works are based on the
fields of data mining research, such as: association rules, Agrawal’s approach. Some of these algorithms are as
clustering, and classification. Mining all frequent itemsets follows: Depth Project [5], Prince’s search [6], MAFIA (a
leads to a massive number of itemsets. This problem can be maximal frequent itemset algorithm for transactional
reduced by finding maximal frequent itemsets (MFI). In this databases) [2], Max-Miner [7] Dynamic Itemset Counting
paper, a new method for mining all MFI based on graph (DIC) [8]. These methods use top-down breadth-first-
theory, is proposed. In the presented method, first, a square search to mine all itemsets. In many applications
matrix corresponding to the transaction elements of (especially in the case of huge and dense data) with long
database is formed. Then the graph of this matrix is
frequent pattern, mining of all suitable itemsets is
considered and its maximal complete subgraphs (maximal
cliques) which are in one-to-one correspondence with MFI
impossible [9]. In real applications, the number of frequent
are found. Experimental results verify the advantages of the itemsets produced from a transaction database can be very
proposed method including: efficiency, simplicity, accuracy, huge and finding all frequent itemsets is impossible [2]. On
reasonable time and memory space. Moreover, the presented the other hand, in some of applications doesn’t need to find
method has good performance in the case of large databases. all itemsets. Therefore, since Maximal Frequent Itemsets
(MFI) include all frequent itemsets [10], only mining MFI
Keywords-Data mining; Association rules; Frequent items; can be considered. This field of research attracts so many
Maximal frequent itemsets; Maximal complete subgraph researchers in the recent decades and many attempts have
(Maximal clique); Graph theory been conducted in the field of mining MFI, till today.
I. INTRODUCTION The rest of this paper is organized as follows; Section
II is dedicated to the related work. Section III describes the
Popularization of computer and improvement the proposed method. The experimental results are shown in
technology of database yields to store more and more data Section IV. A comparison of our method with literature is
in the large databases [1]. Obviously, in this situation presented in Section V. Finally, Section VI contains
mining useful information without using effective methods conclusion and future works.
is impossible [1]. A solution to this problem is data
mining. One of the data mining techniques is discovering II. RELATED WORK
the association rules. Association rules problem is one of
the important issues in the area of data mining and has so There have been a number of attempts in the field of
many applications such as: consumer market basket finding MFI. In [6], Pincer-search algorithm uses a
analysis, inferring patterns from web page access logs, and horizontal data format. This algorithm combines bottom-
network intrusion detection [2]. The purpose of the up and top-down techniques to find MFI. In this method,
association rules is quickly mining the frequent itemsets bottom-up process identifies frequent and non frequent
[3]. The process of discovering the association rules is itemsets. Then frequent itemsets are corrected using top-
divided into two steps: the first step is finding the frequent down process to obtain MFI. In [7], an extension of
items which their support degrees are greater than a Apriori called Max-Miner algorithm has been introduced.
minimum support degree. The second step is producing This algorithm performs subset infrequency pruning and
rules to find the frequent itemsets [1]. It should be noted an itemset with an infrequent subset will be considered as a
that the key step is finding the frequent itemsets. candidate itemset. Then using superset frequency pruning,
the time of search dramatically reduced. In [11] and [2], a
In 1993, R. Agrawal proposed an algorithm for set of MFI is mined and then using Post-Pruning technique
association rules discovery called “Apriori” [4]. The main non maximal patterns are deleted.
idea of this algorithm is identifying all frequent itemsets
whose support is greater than minimum support [1]. In [12], an efficient database encoding technique, a
novel tree structure called PC_Tree and also PC_Miner
,(((
algorithms have been introduced. Database encoding III. PROPOSED METHOD
technique utilizes characteristics of Prime numbers and As mentioned in the previous section, mining MFI
transforms each transaction into a positive integer that has suffers some problems such as: high time complexity, high
all properties of its items. Finally, the PC_Miner algorithm memory space, not certainly getting all MFI, and large
traverses the PC_Tree to mine MFI. In [9], a two-way- scans of database transactions. In order to address these
hybrid algorithm has been proposed. Based on this method problems, a new method based on graph theory has been
mining begins in both top-down and bottom-up, presented, in this paper. The proposed method includes
simultaneously. In addition, the information in bottom-up three steps: constructing a matrix corresponding to the
process can be used to prune the search space in top-down transactions of database, implementation of minimum
process. In [13], a method based on hash has been support condition, drawing the graph of matrix to find all
proposed. This method is a composition of DHP (Direct MFI. These three steps have been explained during the
Hashing and Pruning) and Pincer-search algorithms. following subsections.
In [5], a method called Smartminer has been A. Constructing a matrix corresponding to the
presented. This method gathers and passes tail transactions of database
information which used by a heuristic function to select
In the first step, in order to simplicity and also a matrix
the next node. A smaller search tree requires a smaller representation which is needed during the presented
number of supports counting and does not require method, a unique digit is considered for each element in
superset checking is generated using Smartminer the database. These digits begin from one and increase,
method. In [14], a new algorithm for mining the set of all respectively. Then, corresponding to the transaction
MFI in landmark windows over data streams has been elements in the database a square zero matrix is made. The
proposed. This algorithm is called DSM-MFI (which size of this matrix is related to the disjoint elements in the
stands for Data Stream Mining for MFI). The essential database. Therefore, if the number of disjoint elements in
information about MFI which have been embedded in the database is shown by n, then the size of the corresponding
stream are maintained using the development of summary matrix is n*n. So, n equals to the maximum value of
frequent itemset forest. In [15], a novel algorithm based elements in the database.
on the frequent pattern list (FPL) and bit string technique As an example, consider an input database that
has been presented. According to the frequency of maximum value of its elements is five, as shown in Fig.
frequent items, this algorithm conducts various operations 1(a). The corresponding zero matrix of size 5*5 is shown
on FPL. Moreover, in order to test MFI, bit string has in Fig. 1(b). In this figure, Tid and Item are used instead of
been utilized. transaction id and type of product, respectively.
In [3], an algorithm based on the frequent pattern
graph in order to find MFI has been introduced. This
technique uses a breadth-first-search and depth-first- Tid Item 0 0 0 0 0
search techniques to produce all MFI in database. In [16], 1 1,3,4 0 0 0 0 0
a data structure, Frequent Pattern (FP) Tree has been
introduced. FP tree only stores essential information about 2 2,3,5 0 0 0 0 0
frequent patterns. This work developed a mining
algorithm for FP-tree (called FP-growth). This algorithm 3 1,2,3,5 0 0 0 0 0
only scan database twice and mining information can be 4 2,5 0 0 0 0 0
obtained from FP-tree. [17] proposed an algorithm for
mining frequent itemsets called PIETM (Principle of (a) (b)
Inclusion–Exclusion and Transaction Mapping). PIETM
has some advantages: First, it discovers frequent itemsets Figure 11. (a) is an exam
example of input database, and (b) is the
in a bottom-up manner similar to Apriori. However, it corresponding zero matrix to (a). Since the maximum value of elements
reduces database scanning to only two times. Second, in (a) is 5, the size of matrix (b) is 5*5
PIETM instead of scanning database to count the
Then, corresponding to each pair of element at each
itemsets’ support, it uses the Principle of Inclusion–
transaction the values of its row and column increase by
Exclusion to calculate the support of candidate itemsets.
one. This process is done using the following formula;
Third, mapping and storage the transaction ids of each
item in PIETM are conducted uses transaction intervals, ݔ݅ݎݐܽܯ൫݁ݏܾܽܽݐܽܦሺ݅ǡ ݆ሻǡ ݁ݏܾܽܽݐܽܦሺ݅ǡ ݇ሻ൯ ൌ
which facilitates the itemsets counting process. ݔ݅ݎݐܽܯ൫݁ݏܾܽܽݐܽܦሺ݅ǡ ݆ሻǡ ݁ݏܾܽܽݐܽܦሺ݅ǡ ݇ሻ൯ ͳǢ
Three major problems which have been seen in the
most of conducted works include: high time complexity ݔ݅ݎݐܽܯ൫݁ݏܾܽܽݐܽܦሺ݆ǡ ݅ሻǡ ݁ݏܾܽܽݐܽܦሺ݇ǡ ݅ሻ൯
[7], high memory space [4-6], not certainly getting all MFI ൌ ݔ݅ݎݐܽܯ൫݁ݏܾܽܽݐܽܦሺ݆ǡ ݅ሻǡ ݁ݏܾܽܽݐܽܦሺ݇ǡ ݅ሻ൯ ͳǢ
[16], and large scans of transactions in database [4].
ͳ݅ݔ
൝ ͳ݆ ݕ
Aiming to address these issues, a new method for mining (1)
MFI based on graph theory has been proposed, in this
paper. ݆ͳ݇ ݕ
Where x and y show the number of rows and columns
of the database, respectively. After using (1), the zero
matrix shown in Fig. 1(b) is completed as the matrix
shown in Fig. 2(a). It should be noted that based on (1) As mentioned above, notice that a set of items is called
and as shown in Fig. 2(a), the matrix obtaining from (1) is a frequent itemset if the support degree of all items is
a symmetric matrix. Since we count the frequency of two greater than minimum support. Equivalently, there is an
different items, diagonal values of the matrix are zero. edge between each two vertices in a graph i.e. a complete
Consequently, diagonal values and values under the main graph. Therefore, the maximal cliques of matrix graph are
diagonal of matrix are not important and discard during the in one-to-one correspondence with MFI. So, in order to
presented method. mine all MFI, all maximal cliques must be found [18, 19].
B. Implementation of mininmum support condition
An itemset is called frequent itemset if its support is 1
more than or equal to some threshold value called
minimum support (min_sup) [14]. The minimum support 2
5
is specified by user and related to the application.
0 1 2 1 1 0 1 2 1 1
1 0 2 0 3 0 0 2 0 3 4 3
2 2 0 1 2 0 0 0 1 2
Figure 4. Corresponding graph of the matrix N, shown in Fig. 3
1 0 1 0 0 0 0 0 0 0
1 3 2 0 0 0 0 0 0 0 Finding
Fin all maximal cliques is conducted as follows; first,
we suppose the graph is complete i.e. there is an edge
(a) (b) between
betw each two vertices. Therefore, the maximal clique
includes all vertices, in this step. As an example, in the this
Figure 2. (a) is the completed matrix of what shown in Fig. 1(b) using step, the maximal clique of Fig. 4, is considered as
(1), and (b) specifies the part of (a) which have been considered during follows: <1,2,3,4,5>
the proposed algorithm
In the second step, each row of the matrix N is
In this step of the proposed method, the minimum traversed and the values of row and column corresponding
condition is implemented on the matrix obtained from to each zero value are disjoint in the considered clique i.e.
the previous step. In order to find MFI, all non these two digits are not set together. For example, in the
frequent elements are deleted from the matrix. This is above example, the first zero value is at 1th row and 2nd
done using the following equation; column. Therefore, values 1 and 2 must not set together
݂݅ሺݔ݅ݎݐܽܯሺ݅ǡ ݆ሻ ൏ ̴݉݅݊ݑݏሻݔ݅ݎݐܽܯ݄݊݁ݐሺ݅ǡ ݆ሻ ൌ Ͳǡ and maximal clique is fractured as follows;
< 1, 2, 3, 4, 5 > < 1, 3, 4, 5 >
ͳ ݅ ݊ǡ ݅ ͳ ݆ ݊Ǥ (2)
< 2, 3, 4, 5 >
For example, if we suppose min_sup equals to 22, then
by implementing the minimum condition (2), a new matrix During the row traversal of matrix N, the second zero
(which is called matrix N during this paper) is obtained value is seen at 1th row and 4th column. So, values 1 and 4
which is shown in Fig. 3. are disjoint and we considered the following cliques;
< 1, 3, 4, 5 > < 1, 3, 5 >
0 2 0 0 < 3, 4, 5>
2 0 3 < 2, 3, 4, 5 >
The process is continued as follows: For zero value set
0 2 at 1th row and 5th column the following cliques are
0 considered;
< 1, 3, 5 > < 1, 3 >
< 3, 5 >
Figure 3. The obtained matrix N, after implementing the minimum < 3, 4, 5>
condition on matrix shown in Fig. 2(b) < 2, 3, 4, 5 >
C. Drawing matrix graph and finding all MFI For zero value in 2nd row and 4th column, the process
is conducted as follows;
It should be noted that, in this step, drawing graph is
only for simple understanding the process and shows the < 1, 3 >
main idea of the proposed algorithm (in practice, doesn’t < 3, 5 >
need to draw the graph). To do this, corresponding to each < 3, 4, 5>
row of the matrix, a vertex is considered. Then an edge is
drown between each two vertices, if their corresponding < 2, 3, 4, 5 > < 2, 3, 5 >
value in the matrix is non zero. As an example, graph of < 3, 4, 5 >
matrix N in Fig. 3, has been shown in Fig. 4.
It should be noted that subgraph shown in yellow color which are subsets of a complete graph are deleted (as
has been deleted because of duplication. For zero value in shown in green color). Finally, only maximal cliques
3th row and 4th column, we have the following cliques; remain, which have been shown in pink color.
< 1, 3 > < 1, 3 >
< 3, 5 > < 3, 5 >
< 3, 4, 5 > < 3, 5 > < 4, 5 >
< 4, 5 > < 2, 3, 5 >
< 2, 3, 5 > In order to more comprehension of the proposed
The last zero value sets in 4th row and 5th column, method, the steps of the method have been implemented
subgraph <4, 5> is discarded. In this step, all subgraphs on another example and have been shown in Fig. 5.
Tid Item
1 1,2,5,6
2 3,4,5
0 2 1 1 1 3 3 0 2 0 0 0 3 3
3 5,6,7 2
2 0 2 1 1 4 2 0 0 2 0 0 4 2 1 3
4 2,3,6
5 1,3,4,7 1 2 0 3 2 1 1 0 0 0 3 2 0 0
6 2,6,7 1 1 3 0 2 0 1 0 0 0 0 2 0 0 4
7 1,2,6,7 1 1 2 2 0 2 1 0 0 0 0 0 2 0
8 3,4,5 3 4 1 0 2 0 4 0 0 0 0 0 0 4 7 6 5
9 2,3 3 2 1 1 1 4 0 0 0 0 0 0 0 0
10 1,6,7
(a) (b) (c) (d)
<1,2,5,6,7> <1,2,6,7>
<1,2,4,5,6,7> (1,5)
(1 5) <2,5,6,7>
(1,4)
<2,4,5,6,7>
<2,6,7> <1,2,6,7>
<2,3,6,7>
<2,3,5,6,7> <2,3,7> <2,7>
(3,6)
(3 6) <2,3>
<1,2,3,4,5,6,7> (2,5) (3,7)
(3 7) <2,3>
(1,3)
(1 3) <3,5,6,7>
<3,4,5>
<2,3,4,5,6,7> <3,4,5,7> <3,4,5>
(2,4) (3,7)
(3 7) <4,5,7> <5,6>
<3,4,5,6,7> <4,5,7> <5,7>
(3,6) <4,5,6,7> (4,7)
(4 7) <4,5>
(4,6) <5,6,7> <6,7>
(5,7)
(5 7) <5,6>
(e)
Figure 5. (a) is an input database, (b) and (c) are the corresponding matrices to (a) with min_sup equals to 2. (d) is the graph of matirx (c), and (e)
shows the steps of the proposed algorithm to find maximal cliques. As shown in (e), the cliques marked in blue color has been fractured from the place
with values specified in blue rectangle. Moreover, cliques shown in green color which are subsets of the existing cliques are deleted. Final maximal
cliques have been shown in pink color
IV. EXPERIMENTAL RESULTS in Fig. 6(a). This figure shows the run-time in terms of
In this section, we evaluate the performance of the minimum support. The second experiment has been
proposed method. Two large and famous databases conducted on Chess database [20]. This database
called: Connect and Chess have been chosen for the includes 3,196 transactions with 37 items at each
experiments. All experiments have been performed on transaction and includes 75 unique items. The obtaining
PC with CPU 2.1 G and 2G RAM and running experimental results have been shown in Fig. 6(b).
Microsoft Windows XP. All the algorithms have been V. COMPARISON OF THE PROPOSED METHOD WITH
implemented using MATLAB 7.11.0 (R2010b). LITERATURE
The first experiment has been conducted on Connect As mentioned in Section III, our method has been
database [20]. This database includes 67,557 presented aiming to address four major issues in mining
transactions with 43 items at each transaction and has MFI including: high time complexity, high memory
129 unique items. The obtained results have been shown
space, not certainly getting all MFI, and large scans of compare the time complexity. On the other hand, all of
the database transactions. the presented algorithms have been implemented using
C++ but our method has been implemented by
MATLAB 7.11.0 (R2010b). However, as a matter of
fact, MATLAB language is very slow compared to
C++. This property and the comparison shown in Fig. 6,
verify that our method is really fast than other
conducted works.
Memory: To the best of our knowledge, so many
conducted works [2, 4-16], used tree method which
Time (Sec)
method has good performance in the case of large [9] F. Z. Chen and M. Q. Li, “A two-way hybrid algorithm for
databases especially databases with low number of maximal frequent itemsets mining,” in Proc. 4th Int. Con. on
fuzzy systems and knowledge discovery, pp. 24-27, Haikou,
unique items compared to the total number of 2007.
transactions. [10] H. Yuan, and J. Wu, “Mining maximal frequent patterns with
In the future, we will improve the presented method similarity matrices of data records,” in Proc. 1st Int.Con. on E-
Business Intelligence, Atlantis Press, 2010.
using the structure shown in Fig. 5. In the proposed
[11] R. Agarwal, C. C. Aggarwal, and V. V. V. Prasad, “A tree
method, matrix N is fractured for each zero in the projection algorithm for generation of frequent itemsets,” J. of
considered matrix till to find all cliques related to this parallel and distributed computing (special issue on high
zero. However, we will consider tree structure instead of performance data mining), vol. 61(3), pp. 350-371, 2001.
structure shown in Fig. 5, and instead of mining whole [12] M. Nadimi-Shahraki, N. Mustapha, M . N. B. Sulaiman, and A.
matrix, just sub-tree including clique which has to be B. Mamat, “A new method for mining maximal frequent
fractured is searched. Therefore, the number of mining itemsets,” Int. symposium on information technology, Malaysia,
Kuala Lumpur, pp. 1-4, 2008.
of matrix including cliques really decreases.
[13] D. L. Yang, C. T. Pan and Y. C. Chung, “An efficient hash-
REFERENCES based method for discovering the maximal frequent set,” 25th
annual Int. computer software and applications conference, pp.
[1] H. K. Jnanamurthy, H. V. Vishesh, J. Vishruth, P. Kumar, and 511-516, Chicago, 2001.
R. M. Pai, “Discovery of maximal frequent item sets using [14] H. F. Li, S. Y. Lee and M. K. Shan, “Online mining (recently)
subset creation,” Int. J. of data mining & knowledge maximal frequent itemsets over data streams,” 15th Int.
management, vol.3(1), pp. 27-38, 2013. workshop on research issues in data engineering: stream data
[2] D. Burdick, M. Calimlim, and J. Gehrke, “Mafia: A maximal mining and applications, pp. 11- 18, 2005.
frequent itemset algorithm for transactional databases,” in Proc. [15] J. Qian and F. Ye, “Mining maximal frequent itemsets with
of the 17th Int. Con. on data engineering, pp. 443-452, 2001. frequent pattern list,” in Proc. 4th Int. Con. on fuzzy systems
[3] B. Liu and J. Pan, “A graph based algorithm for mining and knowledge discovery, pp.628-632, Haikou, 2007.
maximal frequent itemsets,” in Proc. 4th Int. Con. on fuzzy [16] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns
systems and knowledge discovery, pp.263-267, Haikou, 2007. without candidate generation: A frequent-pattern tree approach,”
[4] R. Agrawal, T. Imielinski, and A. Swami, “ Mining association J. of data mining and knowledge discovery, vol. 8(1), pp. 53-87,
rules between sets of items in large databases,” in Proc. of ACM 2004.
SIGMOD Conf. on Management of Data, pp. 207–216, 1993. [17] K-C. Lin I-E. Liao, T-P. Chang, and S-F. Lin, “A frequent
[5] Q. Zou, W. W. Chu and B. Lu, “SmartMiner: a depth first itemset mining algorithm based on the Principle of Inclusion–
algorithm guided by tail information for mining maximal Exclusion and transaction mapping,” J. of information sciences,
frequent itemsets,” in Proc. Int. Con. on data mining, pp.570- vol. 276, pp. 278-289, 2014.
577, 2002. [18] J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pattern
[6] D. I. Lin, and Z. M. Kedem, “Pincer-search: an efficient mining: current status and future directions,” J. of data mining
algorithm for discovering the maximal frequent set,” IEEE and knowledge discovery, vol. 5(1), pp. 55-86, 2007.
Transactions on knowledge and data engineering, vol. 14(3), pp. [19] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan,
553-566, 1998. “Automatic subspace clustering of high dimensional data for
[7] R. J. Bayardo, “Efficiently mining long patterns from data mining applications,” in proc. Int. Con. on Management of
databases,” in Proc. the ACM SIGMOD Int. Conf. on data, vol. 27( 2), pp. 94-105, 1998.
Management of data, pp. 85- 93, 1998. [20] http://fimi.ua.ac.be/data/
[8] S. Brin, R. Motwani, J. D. Ullman, and S. Tsure, “Dynamic
itemset counting and implication rules for market basket data,”
in Proc. of ACM SIGMOD Int. Conf. on Management of Data.
Tucson, AZ, pp. 255-264, 1997.