An Improved Frequent Pattern Tree The Child Struct
An Improved Frequent Pattern Tree The Child Struct
https://doi.org/10.1007/s10044-022-01111-1
THEORETICAL ADVANCES
Received: 13 November 2020 / Accepted: 27 August 2022 / Published online: 26 September 2022
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2022
Abstract
Frequent itemsets are itemsets that occur frequently in a dataset. Frequent itemset mining extracts specific itemsets with
supports higher than or equal to a minimum support threshold. Many mining methods have been proposed but Apriori and
FP-growth are still regarded as two prominent algorithms. The performance of the frequent itemset mining depends on many
factors; one of them is searching the nodes while constructing the tree. This paper introduces a new prefix-tree structure
called child structured frequent pattern tree (CSFP-tree), an FP-tree attached with a child search subtree to each node. The
experimental results reveal that the CSFP-tree is superior to the FP-tree and its new variations for any kind of datasets.
Keywords FP-tree · CSFP-tree · Frequent itemset mining · Data mining · CSFP-tree mining · Improved FP-tree
1 Introduction are constructed by using paths with the same prefix item.
Using the conditional FP-tree, the algorithm can generate
The frequent itemsets mining algorithm demands an efficient frequent itemsets.
data structure to store frequent itemsets for further process- Most of the recent proposals based on FP-tree, concen-
ing. FP-growth uses a prefix tree to store the frequent item- trate on the improvement of the Mining phase, whereas
sets and mines frequent itemsets without generating can- improvement in FP-tree structure is of great significance,
didate itemsets. It achieves much better performance and as a better tree structure would reduce the runtime as well
efficiency than Apriori-like algorithms. To avoid the costly as memory requirement. Hence, we explored the possibility
candidate generation, FP-growth algorithm uses a frequent of modification in the basic FP-tree structure.
pattern tree (FP-tree) with a header table. FP-growth algo- Consequently, in this paper an improved tree structure
rithm scans the database two times. After the first scan fre- called child structured frequent pattern tree (CSFP-tree) is
quent 1-itemsets are stored in the header table in decreasing proposed. In the proposed algorithm the child list of each
order of their frequencies. FP-tree is a tree-like data structure node is replaced with a child search tree (CST) to improve
constructed during the second scan. After the second scan, the searching.
the transactions in the transaction database are stored in the
FP-tree in a compressed form. The first instance of each item
in the FP-tree is linked with the corresponding item in the 2 Related work
header table. Nodes of FP-tree with similar items are con-
nected by a link. In FP-growth method FP-tree construction Frequent itemset mining finds specific itemsets with sup-
is the first step. In the second phase, frequent itemsets are ports higher than or equal to a minimum support threshold.
mined from the FP-tree. Mining starts from the least fre- Many frequent itemset mining methods have been intro-
quent item to the most frequent item. Conditional FP-trees duced by various authors, but Apriori and FP-growth are
still regarded as the favored algorithms. Apriori is the oldest
frequent itemset mining algorithm [45]. Many algorithms,
* O. Jamsheela such as DP-Apriori [9], AGM [23, 62], Parallel Apriori
[email protected] [60] and YAFIM [46], are based on Apriori algorithm. Apri-
1
EMEA College of Arts and Science, Kondotty, Kerala, India ori first generates candidate itemsets, then scans the database
2 to confirm whether the candidates are frequent or not. This
Christ University Yeshwantpur Campus, Bengaluru, India
13
Vol.:(0123456789)
method scans the database as much as the maximum length in the databases of widely varying items’ frequencies can
among frequent itemsets. The apriori property is used in cause a dilemma known as the rare item problem. To solve
many recent algorithms. The AprioriDP approach requires the problem, the authors proposed a generalized model of a
only one database scan for both frequent candidate 1-itemset pattern-growth algorithm, called GCoMine [29] to discover
and 2-itemset [6]. gpuDCI algorithm [52] is the paralleliza- the itemsets. The approaches of the above-mentioned algo-
tion of DCI algorithm, a sequential algorithm for Frequent rithms are different and to reduce the time complexity they
Itemset Mining [38]. have applied different methods. Table 1 gives a summary
The parallel Apriori algorithm proposed by Bhalodiya of the prominent algorithm described above. In data min-
et al. [6] is implemented in a parallel processing structure. ing different kinds of trees like decision trees, FP-tree, etc.,
The authors have used different nodes to run the algorithm. are proposed. Improvements in the trees are also a major
The database is partitioned into small sections and each par- research area. Frequent itemset mining without any kind of
tition is assigned to different nodes. Here the result of each trees is also proposed and proved efficient [41, 49].
node should be consolidated to get a final output, so each FP-growth algorithm solved the problem of unwanted
node has to be sent its result to a central node. The authors scans with the use of FP-tree, which consists of a tree for
suggested only a revised algorithm of a modified Apriori storing the transaction in a transaction database and a header
algorithm [17]. They did not compare the algorithm with table containing frequent 1-itemsets sorted in descending
another fast mining algorithm like FP-Tree. Another algo- order of the frequency. Each node of the FP-tree contains
rithm [52] also proposed a parallel version to improve the an item name, a support count, a parent pointer, a child
Frequent Itemset Mining process. In this paper, the authors pointer, and a node-link. The node-link is a pointer that
maximize the utilization of GPU to parallelize the bitmap of connects all nodes with the same item to each other. Since
transactions. They have implemented transaction-wise par- the FP-growth algorithm was proposed, various algo-
allelization and candidate-wise parallelization. The authors rithms, such as LP-tree [45], FIUT [56], AFOPT [36],
have compared the algorithm with the sequential DCI algo- BFP-growth [1] FPgrowth* [19], FPmax* [18], Binary
rithms and proved that this algorithm is faster than the DCI. Search Header Three (BSHT) [25] and FPclose [19] were
Another improved version of Parallel Apriori is proposed by developed adopting the FP-tree structure. A survey on FP-
Qiu et al. [46] called YAFIM (Yet Another Frequent Itemset tree based mining methods is conducted and published by
Mining). This is also a parallel Apriori algorithm based on a Jamsheela and Raju [24]. The algorithm CoMine uses the
specially designed in-memory parallel computing model (on FP-tree and FP-growth method to discover the complete set
the Spark RDD framework ) to support iterative algorithms of correlated itemsets in a database [31]. A modified ver-
and interactive data mining. The transaction database is once sion of CoMine called CoMine++ also uses the FP-tree and
loaded into the Spark RDDs (Resilient Distributed Data- FP-growth method for mining. In the GCoMine algorithm
sets), the memory-based data objects in Spark, then during during the frequent itemset mining, the CP-tree is used [48].
the next iteration, the same is used. Here the authors com- GCoMine used the multiple minAllConf threshold values
pared the algorithm with MPApriori[33] and proved that the to avoid the rare item problem. Algorithms, PrePost [10],
algorithm is 25 times faster than the previous one. Another FIN [4, 11, 12] and PrePost+ ] [13] have used both the meth-
mining algorithm was proposed to find correlated itemsets. ods (FP-tree and Apriori) to improve the mining. Pyun et al.
The existing algorithms with a single minAllConf threshold [45] recommended a new tree structure called linear prefix
AprioriDP Bhalodiya et al. [6] Count-table Require only one database scan for Apriori Apriori
both frequent candidate 1-itemset
and 2-itemset.
gpuDCI Silvestri and Orlando [52] bitmap Parallel conversion of DCI algorithm Apriori DCI
Parallel Ye and Chiang [60] Trie Parallel implementation of a trie- Apriori Trie-base Apriori.
based APRIORI
YAFIM Qiu et al. [46] Hash tree Parallel Apriori algorithm based on Apriori MRApriori.
the Spark RDD framework
GCoMine Rage and Kitsuregawa [48] CP-tree Used multiple minAllConf] to avoid FP-tree CoMine
the rare item problem.
CoMine++ Kiran and Kitsuregawa CP-tree Introduce items’ support intervals to FP-tree CoMine
[29] combine items
13
tree (LP-Tree) to implement an outstanding frequent item- a comparable difference in the number of generated itemsets
set mining technique. An LP-tree is constructed by using and association rules. An improved LBP operator based on
arrays to minimize pointers between nodes. Tsay et al. [56] FP-growth is suggested by Long et al. [37]. The authors have
suggested a novel method, the frequent items ultrametric applied the modified algorithm on a face database for face
trees (FIUT) to enhance the efficiency in obtaining frequent recognition. A full compression frequent pattern tree (FCFP-
itemsets. Tseng et al. [57] introduced an adaptive mecha- Tree) is proposed by Sun et al. [55] to solve the problem of
nism to find a suitable data structure among two pattern large size and rapidly expanding datasets, faced in mining
list structures for mining frequent itemsets. The frequent algorithms. Here the authors have mentioned that to achieve
pattern list (FPL) for sparse databases and the transaction the goal a compromise in memory use should be there. A
pattern list (TPL) for dense databases are the two struc- two-dimensional table is added in another FP-tree-based
tures. Database density is the selection criteria. They sug- algorithm to improve the efficiency of the weighted frequent
gested a method to calculate the database density. Lin et al. itemsets mining proposed by Li and Yin [34]. A modified
[35] proposed an improved frequent pattern (IFP) growth conditional FP-tree (MCFP-tree) and a modified FP-growth
method with the new tree structure to improve the perfor- (MFP-Growth) algorithm are proposed by Ahmed and Nath
mance of mining. IFP-growth needs additional memory to [3] to avoid the creation of conditional FP-trees during the
hold an address table attached to each node. The address frequent itemsets generation from FP-Tree. Many algorithms
table contains the item name and pointer to its child. IFP have been proposed as an improved FP-tree algorithm such
growth does not reduce the size of the tree since it still uses as Caroro et al. [8], Ahmed and Nath [3], Yang et al. [59]
the original FP-tree-based structures with an additional and Zhang et al. [63] . The experimental results show that
address table. Borgelt et al. [7] suggested a new data struc- fast algorithms consume more memory. A detailed study and
ture to find frequent itemsets stating that their algorithm is analysis of the recent FP-tree-based proposals revealed that
the simplest. Racz et al. [47] used arrays to store the nodes more efficient data structures can improve the runtime and
and suggested an alternate method to find frequent item- memory usage of the mining algorithm. The above discussed
sets without rebuilding conditional FP-trees. The recursive algorithms did not consider the searching time during the
mining process is replaced by building new tree structures, tree construction. This leads to the development of a novel
which avoids rebuilding each conditional pattern base. Deng tree structure based on FP-tree.
et al. [10] proposed N-lists and PPC-tree (PrePost-tree) to
find frequent itemsets. In this method, each node contains
its pre-order and post-order sequence numbers. Deng et al.
[12] proposed another method (FIN) by using Nodeset, a 3 CSFP‑tree: child structured frequent
more efficient data structure, for mining frequent itemsets. pattern tree
FIN applied the pruning method suggested by Rymon [50] to
reduce the search space. In PrePost two properties have been The Child Structured Frequent Pattern Tree is a modi-
used but FIN requires only the pre-order (or post-order code) fied frequent pattern tree which contains child sub-trees
of each node. Deng et al. proposed another more efficient with each node. The child sub-trees are a kind of child
method, PrePost+([13]). The authors used the same prun- search tree formed with the children of each node. Details
ing method of FIN and node structure of PrePost to improve of the CSFP-tree construction procedure and related tech-
the performance. A modified FP-tree is used for ontology niques are presented in this section. Table 2 shows the
learning with applications in education [51]. Authors have details of the variables used in the article.
used a regular expression parser approach, deterministic
finite automata (DFA) here for concept extraction. The
same authors have introduced a more efficient method to
frequent itemset mining [11]. Aryabarzan et al. proposed
another efficient data structure called NegNodeset, sets of Table 2 List of variables used
nodes in a prefix tree. Another modification of FP-tree is
Variable name Description
used in the medical data environment [15]. The procedure
is the same as FP-tree, but the authors have removed the TID Transaction id
infrequent items from the transaction by applying a new DB Transaction database
database scan. In another research paper [8], to enhance the dcp Direct child pointer
frequent itemset generation the authors applied to the FP- lsp Left sibling pointer
growth algorithm a modified anti-monotone support con- rsp Right sibling pointer
straint. The modification is applied after constructing the pp Parent pointer
FP-tree. Authors proved that the mining process resulted to
13
n
∑
(1)
( )
C(P) = k−i
i=0 applied in various algorithms and proved faster than linear
search [21, 39, 40, 53]. Binary tree has also been proved
where, C(P): Complexity to construct path P. k : number of as an efficient tree for searching [32]
items in itemset I. j: number of items in transaction t. n: j − 1 A binary search tree (BST), is a tree like data structure
Each node of the tree has a set of children. In the root where each node has no more than two child nodes. The left
node, each child represents a separate branch of the tree by sub-tree contains only nodes with keys less than the par-
holding unique items. When a new transaction is inserted ent node; the right sub-tree contains only nodes with keys
into the tree, the children of the root have to be searched greater than the parent node. The main advantage of a binary
for the existence of the first item of the transaction. If the search tree is that it remains ordered, which provides a faster
item is found among the children of the root, the frequency search than many other data structures [22].
count of that item is increased by one, otherwise a new child By analyzing the insertion procedure of FP-tree, we can
is added to root and form a new branch with the remaining find that the insertion of each item leads a search among the
items. When a new node is added to the root, the number children to find the exact location. The CSFP-tree is intro-
of child pointers in the root is increased. In worst case, the duced to enable fast searching among the child list of the
children of root include all items. nodes. The concept of a binary search tree is used to con-
FP-tree and most of its recent improvements use a lin- struct the child search trees (CST) of each node.
ear data structure to store the children of each node (Pre- The concept can be illustrated by using an exam-
Post [10], FIN [12], PrePost+ [13]). Only a dynamic data ple. Table 2 lists the variables and their descriptions. Table 3
structure is appropriate to store the children because the is a simple transnational database with 5 transactions. The
number of children cannot be predicted earlier and new minimum support is fixed as one and Table 4 contains fre-
items have to be added dynamically. Dynamic linear data quent items with frequencies of each. Figure 1 is the FP-tree
structures like Linked List prefer a linear search to a binary with the structure of normal FP-tree which is constructed
search. Linear search is the simplest search algorithm. In with the 5 transactions in Table 3. The FP-tree in Fig. 2 is
linear search the first item is compared first, then the sec- also the same FP-tree which is displayed to show its search
ond item, and so on until you find the target item or reach path among the children. Figure 3 shows the CSFP-tree of
the end of the list. If the list grows in size, the number of the 5 transactions in Table 3. The CSFP tree construction
comparisons required to find a target item in both worst steps are illustrated in detail with an example in Sect. 3.4. In
and average cases grows linearly. In FP-tree the number Figs. 2 and 3, the direct children of the root are highlighted
of child pointers are added during the construction pro- with shaded circles and those without shades are normal
cess and hence the comparison becomes time consuming. nodes. The root node is the node with symbol R.
Binary search is faster than linear search in average case
and worst case [5, 28, 39, 43, 53, 54]. Binary search is also
13
3.3.1 Node structure
13
descending order of the support. All the nodes have a par- Table 5 The transaction database
ent link point to its respective parent. TID Transactions Sorted transactions after
removing infrequent
3.4 Tree construction items
1 B,D,F F,B,D
The tree is created with the frequent 1-itemsets. The data-
2 B,C,J, P, Q C, B, J
base is scanned for the first time to find the frequent 1-item-
3 A,B,G,H B, A, H
sets as in the case of FP-tree. Thereafter, the header table
4 B,C,D,E,G C,B,D, E
is created with the frequent 1-itemsets sorted in descending
5 C,D,E,F,J C,F, D, E, J
order of the support. Next step is the construction of the
6 B,C,F,J, R, T C,F, B, J
CSFP-tree.
7 A, D, E, S D, E, A
In order to construct the CSFP-tree a second scan of
8 C,F,L, U C,F,L
the database is carried out. Take the first transaction,
9 D,F,H,I, S F,D, H
after removing the infrequent items sort the transaction in
10 C,F, R C,F
descending order of the support. Let X be the first item of
11 K, L,M K, L, M
the transaction. X is the first child of the root and hence
12 A,K,M, Q A,K,M
becomes the direct child of the root. The remaining items
13 E,K, T E, K
of the transaction are formed as a branch of direct children
14 H,L,U H, L
from X. To insert the second transaction compare the first
15 M,N, O M
item with X. If the item is smaller than (or greater than) X,
a new node is created and added as the left sibling (or right
sibling) of node X. If the two items are same, then increase
the support count of node X and compare the next item of Table 6 List of frequent items Item Frequency
the transaction with the direct child of X and so on. The
insertion of the remaining transactions is as follows. C 6
Remove the infrequent items and sort the items of each F 6
transaction. Compare the first item I 1 of each transaction B 5
with ‘X,’ the direct child of the root. If I 1 matches with ‘X,’ D 5
increment the support count of ‘X’ and compare next item E 4
I 2 of the transaction with direct child of ‘X.’ This process A 3
is continued (match, increment, move to direct child) till a H 3
mismatch between an item I n and a node ‘Y’ is found. Then J 3
if I n < Y , continue the process with the left sibling of ‘Y.’ K 3
Otherwise, continue with the right sibling of ‘Y.’ L 3
If match with a node ‘Y’ is successful and Y has no M 3
direct child, then the remaining items, if any, are formed as
a branch of direct children from ‘Y.’
If the search for a match of an item I j ends in a NULL The details are explained with an example in the following
(ie when I j < Y and left sibling of ‘Y’ is NULL or I j > Y section.
and right sibling of Y is NULL), then insert I j as a sibling, The CSFP-tree construction is illustrated with the data
Z (left sibling if I j < Y or right sibling if I j > Y ) of ‘Y.’ The given in Tables 5. The minimum support is set as 3. Table 6
remaining items, if any, are formed as a branch of direct contains the list of frequent items. Column 3 of the table
children from ‘Z.’ contains the sorted transactions of column 2 after removing
After inserting all the transactions, the children of each the infrequent items. ‘F B D’ is the first transaction to be
node form a sub-tree (CST) and the direct child of each node inserted first. ‘F’ is inserted as the direct child of root and
become the root of the CST. The search of an item among the remaining items ‘B’ and ‘D’ are formed as a branch of
the children is very efficient with the CST. The sub-tree is the direct child from F. ‘F’ became the root of the CST con-
called a child search tree because the sub-tree is created by structed by the children of the root. B is the direct child of F
using the children of each node. and the root of the CST of the children of node F. Figure 4b
The CST is a tree structure based on the order property shows the CSFP-tree after inserting the first transaction. To
of the binary search tree where the children of each node insert the second transaction ‘C B J,’ search C on the CST of
are added according to the rule of the binary search tree. the root. ‘F’ is the direct child of root and ‘F’ has no siblings.
Therefore, create a new node with C and compare C with
13
F. Insert C as the left sibling of F because C is less than F. Figure 5 is a CSFP-tree after inserting all the trans-
Set the parent of C with the root node. Add a new branch actions in column 3 of Table 5. All the bold ovals with
from node C with ‘B and J.’ Next transaction to be inserted items F, C, B, A, D, E, K, H and M are the children
is ‘B A H.’ Compare B with the direct child ‘F’ of the root. of the root node. These children form a CST structure.
B is less than F. Move to the left sibling of F. C is the left The parent pointer of each node is linked to its respective
sibling of F. Compare B with C. Add B as the left sibling of parent. This criterion is used because each path has to
C because B is less than C. Now root has three children ie. F be taken separately during the mining process. The other
C and B. F is the direct child of root and set as the root of the oval shaped nodes are also part of a CST but they are the
CST of the root. C and B are siblings of F and added to the children of some internal nodes. The internal node F with
left subtree of F. Figure 4c shows the CSFP-tree after insert- support 4 has 3 children D, B and L. The CST of F is
ing the 3rd transaction. Insert all other transactions according formed with D, B, L. D is the root of the CST. The thick
to the above mentioned criteria. dashed lines denote left and right siblings. The other nodes
13
are normal nodes representing transactions. For the sake as null. lines 4 to 25 are used to create the CSFP-tree. Two
of clarity, the links which connect similar items are not procedures are used here. The procedure ‘childsearch’ is
included in the diagram. used to search the specific item in the partially constructed
CSFP-tree.
The procedure childsearch (Fig. 7) accepts as its param-
4 Frequent itemset mining with CSFP‑tree eter, the current child of the root ie; temp and the item ie; I
to be searched in the child BST. The transaction is already
Frequent itemset mining with CSFP-tree involves two arranged in the order of the frequencies of each item. There-
main algorithms and two sub algorithms. Details of the fore, the BST is formed according to the item name only
algorithms are presented in this section. and not according to the frequency. If the I is less than the
current child temp, it will be moved to the left branch of the
current node and vice versa.
4.1 CSFP‑tree construction algorithm If the search is successful, the procedure returns the node
with the searched item. Otherwise, create a new node nd
Algorithm-1 in Fig. 6 is used to construct the CSFP-tree. with the item, add nd in proper location and return nd. A
Step-1 is used to find all the frequent 1-itemsets. Next step flag is set to know whether the returned node is an exist-
entails the creation of the header list. 3 rd step set the root ing node or a newly created node. To create new branches,
the procedure ‘createBranch’ is invoked (Fig. 6) by sending
the processed transaction as an argument. The output of the
algorithm is the CSFP-tree.
4.2 CSFP‑growth algorithm
13
n:number of transactions in DB
m:number of items in each transaction
The real challenge, while introducing a new algorithm as an 1 Best case O(1)
improvement of an existing one, is to reduce computational 2 Average case O(log n))
complexity such as space and time complexities. The impor- 3 worst case O(log n) [54]
tant matter to be considered is how the time and space com-
plexity can be addressed. The time complexity of FP-Tree By using the child list as a binary tree, the time com-
is discussed in detail in Kadappa and Nagesh [27], Yin et al. plexity for inserting a transaction to the tree is reduced
[61], WEN-YUAN and LIU [58], Jia and Liu[26], Agapito to,0(m*log n) To construct the CSFP-tree the time com-
et al.[2]. The overall time complexity of the FP-tree is ana- plexity is
lyzed in the above cited papers. The time complexity of FP-
O(n*m*log n) (4)
tree creation is specified in Kosters et al. [30] The CSFP-tree
algorithm has three main steps as mentioned below:- 3 The last step is the mining of frequent itemsets from the
CSFP-tree. The time complexity of this phase is same
1 During the first scan, it extracts all the frequent items as FP-Tree.
from the database and sorts the frequent items in
descending order of the frequency. The time and space
13
The space complexity of CSFP-tree is the same as FP-Tree but the memory usage of PrePost+ and FIN algorithms are
because the node structure and the number of nodes in the very high while comparing with H-mine and FP-growth. The
tree are the same as FP-tree. The advantage of the CSFP- most efficient algorithm among these seven is the NegFIN
tree is that without increasing memory, it can improve the algorithm, but it uses more memory than other FP-Growth
running time of the algorithm. ans CSFP-tree.
Seven datasets are used in the experiments. The details
of the datasets are shown in Table 7. These datasets are
publicly available in the FIMI repository (http://fimi.ua.ac.
6 Performance evaluation be) [16] and all are real datasets except T10I4D100K. The
Mushroom, Connect and Pumsb are dense datasets. Chess
In this section, the experimental results are presented. The and Retail are sparse datasets. To evaluate the runtime and
proposed algorithm is compared with other seven algorithms memory usage of the proposed algorithm, the datasets have
-FP-growth [20], H-Mine [44], FIN [12], PrePost [10], been used as their original form without losing any data.
PrePost+[13], DFIN[11] and NegFIN [4]. The proposed During the run time test, a huge number of frequent item-
method is implemented in java and the platform is Intel sets could be generated from each dataset. These outputs
CPU-3.3 GHz with 4GB RAM and Windows 7 32bit OS. are too large to compare manually. Hence, to evaluate the
Implementations of all other algorithms are taken from the accuracy of the new algorithm, a subset of data from each
SPMF website (http://www.philippe-fournier-viger.com/ datasets are used with a convenient minimum support and
spmf/index.php?link=license.php) [14]. FP-growth algo- compared the output of every datasets with the output of
rithm is chosen as the baseline algorithm. PrePost+ has been other algorithms. It is observed that the frequent itemsets
proven as the best algorithm among all node-based meth- generated from the proposed algorithm are the same as the
ods. H-mine is included because of its efficiency in memory frequent itemsets generated by the other algorithms.
usage. FIN, PrePost+, DFIN, NegFIN are recent algorithms Figures 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21
for finding frequent itemsets. The run time of PrePost+ and and 22 show the experimental results. With all the datasets
FIN are lesser than the run time of H-mine and FP-growth the runtime of the proposed algorithms is better than FP-
growth and H-mine. In the memory usage evaluation, the
proposed algorithm uses less memory than the other algo-
Table 7 Datasets
rithms except FP-growth. Fig. 9 and 10 show the run time
Datasets Transactions Items and memory consumption of all algorithms on the dataset
‘connect.’ CSFP-growth performed well with the lowest
Accidents 340183 468
minimum supports but NegFIN and DFIN are the fastest
Retail 88162 16470
algorithm among the other algorithms. Algorithm H-mine
Connect 67557 129
could not be included in the result because it takes more time
Pumsb 49046 7116
with dataset ‘connect.’
Chess 3196 84
FP-growth consumes less memory than all others. The
Mushroom 8124 119
memory consumption of CSFP-growth is near to the FP-
T10I4D100K 98487 949
growth. PrePost drastically increases its memory usage when
13
minimum support crosses 80%. PrePost+ consumes more The results on the dataset ‘Mushroom’ are shown in
memory than FP-growth and CSFP-growth. The graph rep- Figs. 11 and 12. Here the run time of all algorithms except
resenting FIN is a consistent line with all minimum support H-Mine are the same up to 10% minimum support. With
values and consumes more memory than others with the the lowest minimum supports, the proposed CSFP-growth
highest minimum support. and NegFIN are the fastest algorithm. CSFP-growth and
13
FP-growth consume lesser memory than other algorithms memory usage of H-mine is lesser than PrePost, FIN and
for all minimum supports. With 2% minimum support, PrePost+.
CSFP-growth consumes less memory than FP-growth. The For dataset ‘Pumsb,’ PrePost+, PrePost, FIN, DFIN
and NegFIN have performed almost the same, though
13
PrePost+ and PrePost run a little bit faster than FIN. The memory than the proposed CSFP-growth algorithm. Pre-
performance of CSFP-growth with dataset Pumsb is not Post is one of the fastest algorithms with ‘Pumsb’ but con-
good as with other datasets but performs better than FP- sumes more memory than the other algorithms when the
growth. Recent algorithms (FIN, PrePost, FIN, NegFIN) minimum support value reduces. The results are given in
perform well with dataset ‘Pumsb,’ but consume more Figs. 13, and 14.
13
Figures 15 and 16 show the runtime and memory usage In memory usage evaluation on dataset ‘chess,’ PrePost
of the algorithms on the dataset ‘Chess.’ Although all the consumes more memory than other algorithms. FP-growth
algorithms except H-mine performed well, CSFP-growth consumes less memory than the other algorithms. The
and FP-growth consumes less memory. H-mine is the slow- memory consumption of CSFP-growth is lesser than other
est algorithm with each and every minimum support value.
13
algorithms except for FP-growth. The algorithm DFIN con- The remaining two datasets, Retail and Chess, are sparse
sumes more memory than the algorithm NegFIN. datasets. The time complexity of the algorithms with dataset
Figures 17 and 18 show the experimental results on the Chess is almost the same except H-mine and FP-growth,
dataset ‘Retail.’ The variations in the results are not sig- but when analyzing the memory usage of the algorithms
nificant, but NegFIN and DFIN are a little faster than other with the same dataset, all algorithms consume more memory
algorithms. PrePost and PrePost+ consume more memory except the proposed algorithm and FP-growth. The retail
than the other 3 algorithms. The memory usage of the CSFP- dataset contains the highest number of transactions and the
growth algorithm is near to FP-growth which consumes less highest number of items among the 5 datasets. The runtime
memory than others. of CSFP-growth is the same as the other recent algorithms,
For dataset ‘Accidents,’ CSFP-growth, PrePost+, Pre- PrePost and PrePost+, but the memory usage of the CSFP-
Post, FIN, DFIN and NegFIN have performed almost the growth is less than 1/4 of the memory usage of the recent
same. Recent algorithms (FIN, PrePost, FIN, NegFIN, etc.) algorithms.
perform well with dataset ‘Accidents,’ but consume more The recent algorithms NegFIN and DFIN outperform
memory than the proposed CSFP-growth algorithm. The with the datasets connect and retail but consumes more
results are given in Figs. 19 and 20. memory than others. CSFP-growth performs well with data-
Figures 21 and 22 show the experimental results on the sets T10I4D100K, pumsb, mushroom and accidents and also
dataset ‘T10I4D100K.’ The variations in the results are uses less memory than others.
not significant, but NegFIN and DFIN and FIN are a little The number of items in the dense dataset Pumsb is 7116
slower than other algorithms. NegFIN and DFIN consume and the number of the transactions in the sparse dataset
more memory than the other algorithms. The memory usage Retail is 16470. The number of transactions in Retail is
of the CSFP-growth algorithm is near to FP-growth which double the number of transactions in Pumsb. CSFP-growth
consumes less memory than others. algorithms perform better in Retail but not in Connect. One
Three datasets, Connect, Pumsb and Mushroom, are reason is that the Pumsb is a dense dataset. Another reason
dense datasets. The runtime results of Mushroom and Con- may be the tree structure. Here the CST is not a balanced
nect show that the proposed algorithm is performed better tree. The tree may be formed as a skewed tree. In such situ-
than other algorithms even with the huge number of trans- ations, the searching time is the same as linear search.
actions but with a fewer number of items. The performance In FP-tree and its new variants, tree construction is a
of the other algorithms outperforms when both the number continuous procedure. The main FP-tree is constructed first,
of items and number of the transactions are high. It can be then during the mining process, the conditional FP-trees are
analyzed from the runtime result of Pumsb. Among the three recursively constructed. The CST creation using the child
datasets, Connect contains the highest number of transac- nodes is not an overhead during the tree construction. Dur-
tions. If we analyze the runtime results of the three datasets ing the insertion of each item, child list of each level has to
we can conclude that the proposed algorithm is better when be searched to check whether the item is present or not. Nor-
the number of items is less and the performance will not be mally a linear search is carried out. The CST is constructed
affected when the number of transactions is increased. to implement the binary search and hence the search time
is reduced.
13
The memory consumption of the proposed CSFP-tree to get a balanced CSFP-tree. Therefore, the algorithm would
is the same as FP-tree. In FP-tree and other FP-tree vari- perform better with any kind of dataset, but transaction sort-
ants, each node should be pointed from its parent. Let x is a ing is a time-consuming process. As future work, a new
node with 4 children z1, z2, z3 and z4. Then x contains four transaction sorting technique with lesser time can be intro-
pointers and each pointer points to each child of x. But in duced and in the future, the CSFP-tree algorithm will have
the proposed algorithm each parent points to only one child to be extended with a balanced BST to get better results.
node. Hence, in CSFP-tree x has only one child pointer
which points to the first child z1. z1 contains maximum of
2 sibling pointers to hold z2 and z3. z4 is pointed from z2 Funding No funding is available.
or z3. In the FP-tree structure, the four pointers are stored
Availability of data and material Publicly available data is used.
in x but in CSFP-tree the pointers are arranged in different
nodes. There is no increase in the number of pointers but Code availability Not applicable.
only rearranged. In the proposed algorithm no additional
nodes are created but the existing nodes are re-positioned to Declarations
form a CSFP-tree structure. Therefore no extra memory is
used to construct the proposed CSFP-tree. Conflicts of interest No conflict of interest.
7 Conclusion
References
Association rule mining is the task of finding out interesting
1. Adnan M, Alhajj R (2011) A bounded and adaptive memory-
rules from the databases which would help in many areas based approach to mine frequent patterns from very large
of its applications in different ways. Many improvements databases. IEEE Trans Syst Man Cybernet Part B Cybernet
have been introduced so far in this area. However, the execu- 41(1):154–172
tion time increases significantly with an increase in memory 2. Agapito G, Guzzi PH, Cannataro M (2018) Parallel and distrib-
uted association rule mining in life science: A novel parallel algo-
usage. The improved algorithms have proved its efficiency rithm to mine genomics data. Information Sciences
only in runtime by compromising high memory usage. For 3. Ahmed SA, Nath B (2019) Modified fp-growth: an efficient fre-
huge datasets, optimizing the memory usage is quite impor- quent pattern mining approach from fp-tree. In: International
tant. In this sense, many of the proposed algorithms are not conference on pattern recognition and machine intelligence, pp
47–55. Springer
efficient. A new data structure named CSFP-tree, which is 4. Aryabarzan N, Minaei-Bidgoli B, Teshnehlab M (2018) negfin: an
more efficient than FP-Tree is introduced. The CSFP-tree is efficient algorithm for fast mining frequent itemsets. Expert Syst
used in the new algorithm named CSFT-growth for mining Appl 105:129–143
complete frequent patterns. The modified prefix tree struc- 5. Bae S (2019) Searching and sorting. In: JavaScript data structures
and algorithms, pp 125–149. Springer
ture, CSFP-tree is proposed to speed up frequent itemset 6. Bhalodiya D, Patel K, Patel C (2013) An efficient way to find
mining problems without using extra memory. The order frequent pattern with dynamic programming approach. In: 2013
property of the child search trees is based on the concept of Nirma university international conference on engineering (NUi-
binary search. Hence the structure of the CSFP-tree enables CONE), pp 1–5. IEEE
7. Borgelt C (2010) Simple algorithms for frequent item set mining.
faster find/insert operations. This is achieved without adding In: Advances in machine learning II, pp 351–369. Springer
any extra node in comparison to FP-tree. Experiments car- 8. Caroro RA, Sison AM, Medina RP (2019) Modified anti-mono-
ried out with the different standard datasets established the tone support pruning on fp tree for improved frequent pattern
efficacy of the proposed algorithm in comparison to FP-tree generation. In: Proceedings of the 2nd international conference on
software engineering and information management, pp 138–142
and its new variants. 9. Cheng X, Su S, Xu S, Li Z (2015) Dp-apriori: a differentially
The problem with BST is that, depending on the order private frequent itemset mining algorithm based on transaction
of inserting elements in the tree, the shape of the tree may splitting. Comput Secur 50:74–90
vary. In the worst cases, the tree will look like a linked list 10. Deng Z, Wang Z, Jiang J (2012) A new algorithm for fast mining
frequent item sets using n-lists. Sci China Inf Sci 55(9):2008–2030
in which each node will have only the right child. Therefore, 11. Deng Z-H (2016) Diffnodesets: an efficient structure for fast min-
with a few datasets the proposed CSFP-tree is not performed ing frequent itemsets. Appl Soft Comput 41:214–223
well. As future work, extra sorting can be applied to the 12. Deng Z-H, Lv S-L (2014) Fast mining frequent itemsets using
transaction database. After filtering and sorting the items nodesets. Expert Syst Appl 41(10):4505–4512
13. Deng Z-H, Lv S-L (2015) Prepost+: An efficient n-lists-based
in each transaction, the entire transaction set can be sorted algorithm for mining frequent itemsets via children-parent equiva-
according to the first item of each transaction. The middle lence pruning. Expert Syst Appl 42(13):5424–5432
transaction can be inserted into the tree as the first insertion
13
14. Fournier-Viger P, Gomariz A, Gueniche T, Soltani A, Wu C-W, 35. Lin K-C, Liao I-E, Chen Z-S (2011) An improved frequent pat-
Tseng VS (2014) Spmf: a java open-source pattern mining library. tern growth method for mining association rules. Exp Syst Appl
J Mach Learn Res 15(1):3389–3393 38(5):5154–5161
15. Gao X, Xu F-Q, Zhu Z-M (2019) The application of improved 36. Liu G, Lu H, Lou W, Xu Y, Yu JX (2004) Efficient mining of
fp-growth algorithm in disease complications. DEStech Trans frequent patterns using ascending frequency ordered prefix-tree.
Comput Sci Eng (cmso) Data Min Knowl Disc 9(2):249–274
16. Goethals B (2003) Fimi repository website. http://fimi.ua.ac.be/ 37. Long S, Zheng C, Wu C, Qi T, Li X, Zhu Y, Liu J, Li J, Shuai J,
data/, [fimi web site] Xie Z, et al (2019) Face recognition based on fp-growth improved
17. Goethals B, Zaki MJ (2003) A fast apriori implementation. In: lbp operator. In: Proceedings of the 2019 3rd international confer-
Proceedings of the IEEE ICDM workshop on frequent itemset ence on digital signal processing, pp 100–103
mining implementations, pp Vol. 90 of CEUR Workshop Proceed- 38. Lucchese C, Orlando S, Palmerini P, Perego R, Silvestri F (2003)
ings. IEEE kdci: A multi-strategy algorithm for mining frequent sets. In: Pro-
18. Grahne G, Zhu J (2003) Efficiently using prefix-trees in mining ceedings of the IEEE ICDM workshop of frequent itemset mining
frequent itemsets. In FIMI, vol 90 implementations (FIMI), Melbourne, Florida. Citeseer
19. Grahne G, Zhu J (2005) Fast algorithms for frequent itemset min- 39. Marquez A, Leon J, Vazquez S, Franquelo L, Carrasco J, Galvan
ing using fp-trees. IEEE Trans Knowl Data Eng 17(10):1347–1362 E (2016) Binary search based mppt algorithm for high-power pv
20. Han J, Pei J, Yin Y (2000) Mining frequent patterns without systems. In: 2016 10th international conference on compatibility,
candidate generation. In: ACM SIGMOD record, vol 29, pp power electronics and power engineering (CPE-POWERENG),
1–12. ACM pp 168–173. IEEE
21. Heras F, Morgado A, Marques-Silva J (2011) Core-guided 40. Munro JI (2000) On the competitiveness of linear search. In: Euro-
binary search algorithms for maximum satisfiability. In: pean symposium on algorithms, pp 338–345. Springer
Twenty-Fifth AAAI conference on artificial intelligence 41. Nagaraju S, Kashyap M, Bhattachraya M (2017) An effective den-
22. Hibbard TN (1962) Some combinatorial properties of certain sity based approach to detect complex data clusters using notion
trees with applications to searching and sorting. J ACM (JACM) of neighborhood difference. Int J Autom Comput 14(1):57–67
9(1):13–28 42. Panjwani J (2010) Application of FP tree growth algorithm in text
23. Inokuchi A, Washio T, Motoda H (2000) An apriori-based algo- mining. PhD thesis, Citeseer
rithm for mining frequent substructures from graph data. In: 43. Parmar VP, Kumbharana C (2015) Comparing linear search and
Principles of data mining and knowledge discovery, pp 13–23. binary search algorithms to search an element from a linear list
Springer implemented through static array, dynamic array and linked list.
24. Jamsheela O, Raju G (2015) Frequent itemset mining algo- Int J Comput Appl, 121(3)
rithms: a literature survey. In: 2015 IEEE international advance 44. Pei J, Han J, Lu H, Nishio S, Tang S, Yang D (2007) H-mine: fast
computing conference (IACC), pp 1099–1104. IEEE and space-preserving frequent pattern mining in large databases.
25. Jamsheela O, Raju G (2015) An adaptive method for mining IIE Trans 39(6):593–605
frequent itemsets efficiently: an improved header tree method. 45. Pyun G, Yun U, Ryu KH (2014) Efficient frequent pattern mining
In: 2015 international conference on advances in computing, based on linear prefix tree. Knowl-Based Syst 55:125–139
communications and informatics (ICACCI), pp 1078–1084. 46. Qiu H, Gu R, Yuan C, Huang Y (2014) Yafim: A parallel fre-
IEEE quent itemset mining algorithm with spark. In: 2014 IEEE Inter-
26. Jia K, Liu H (2017) An improved fp-growth algorithm based on national parallel & distributed processing symposium workshops
som partition. In: International conference of pioneering computer (IPDPSW), pp 1664–1671. IEEE
scientists, engineers and educators, pp 166–178. Springer 47. Rácz B (2004) nonordfp: An fp-growth variation without rebuild-
27. Kadappa V, Nagesh S (2019) Local support-based parti- ing the fp-tree. In: FIMI
tion algorithm for frequent pattern mining. Pattern Anal Appl 48. Rage UK, Kitsuregawa M (2015) Efficient discovery of corre-
22(3):1137–1147 lated patterns using multiple minimum all-confidence thresholds.
28. Karimov E (2020) Binary search tree. In: Data structures and J Intell Inf Syst 45(3):357–377
algorithms in swift, pp 87–100. Springer 49. Reddy H, Raj N, Gala M, Basava A (2020) Text-mining-based
29. Kiran RU, Kitsuregawa M (2012) Efficient discovery of correlated fake news detection using ensemble methods. Int J Autom Com-
patterns in transactional databases using items’ support intervals. put, pp 1–12
In: International conference on database and expert systems appli- 50. Rymon R (1992) Search through systematic set enumeration.
cations, pp 234–248. Springer Technical Reports (CIS), p 297
30. Kosters WA, Pijls W, Popova V (2003) Complexity analysis of 51. Shatnawi S, Gaber MM, Cocea M (2019) A heuristically modified
depth first and fp-growth implementations of apriori. In: Interna- fp-tree for ontology learning with applications in education. arXiv
tional workshop on machine learning and data mining in pattern preprint arXiv:1910.13561
recognition, pp 284–292. Springer 52. Silvestri C, Orlando S (2012) gpudci: exploiting gpus in frequent
31. Lee Y-K, Kim W-Y, Cai YD, Han J (2003) Comine: Efficient min- itemset mining. In: 2012 20th euromicro international confer-
ing of correlated patterns. In ICDM, pp 581–584 ence on parallel, distributed and network-based processing, pp
32. Levy CC, Tarjan RE (2019) Splaying preorders and postorders. 416–425. IEEE
In: Workshop on algorithms and data structures, pp 510–522. 53. Subramanian S, Berzish M, Tripp O, Ganesh V (2017) A solver
Springer for a theory of strings and bit-vectors. In: 2017 IEEE/ACM 39th
33. Li N, Zeng L, He Q, Shi Z (2012). Parallel implementation of apri- international conference on software engineering companion
ori algorithm based on mapreduce. In: 2012 13th ACIS interna- (ICSE-C), pp 124–126. IEEE
tional conference on software engineering, artificial intelligence, 54. Sultana N, Paira S, Chandra S, Alam SS (2017) A brief study and
networking and parallel/distributed computing, pp 236–241. IEEE analysis of different searching algorithms. In: 2017 second inter-
34. Li Y, Yin S (2020) Mining algorithm for weighted fp-tree fre- national conference on electrical, computer and communication
quent item sets based on two-dimensional table. J Phys: Conf Ser technologies (ICECCT), pp 1–4. IEEE
1453:012002 55. Sun J, Xun Y, Zhang J, Li J (2019) Incremental frequent itemsets
mining with fcfp tree. IEEE Access 7:136511–136524
13
56. Tsay Y-J, Hsu T-J, Fiut J-RY (2009) A new method for mining 62. Zaki MJ (2000) Scalable algorithms for association mining. IEEE
frequent itemsets. Inf Sci 179(11):1724–1737 Trans Knowl Data Eng 12(3):372–390
57. Tseng F-C (2012) An adaptive approach to mining frequent item- 63. Zhang HL, Xue Y, Zhang B, Li X, Lu X (2019) Eeg pattern recog-
sets efficiently. Expert Syst Appl 39(18):13166–13172 nition based on self-adjusting dynamic time dependency method.
58. Wen-Yuan L, Liu S-FF (2003) The study of association agorithm In: International conference on data service, pp 320–328. Springer
BGL based on binary system and oriented graph
59. Yang K, Quan T, Sun Y (2019) Distributed fp-growth with node Publisher's Note Springer Nature remains neutral with regard to
table for large-scale association rule mining, Nov. 26 2019. US jurisdictional claims in published maps and institutional affiliations.
Patent 10,489,363
60. Ye Y, Chiang C-C (2006) A parallel apriori algorithm for frequent Springer Nature or its licensor holds exclusive rights to this article under
itemsets mining. In: Fourth international conference on software a publishing agreement with the author(s) or other rightsholder(s);
engineering research, management and applications, 2006, pp author self-archiving of the accepted manuscript version of this article
87–94. IEEE is solely governed by the terms of such publishing agreement and
61. Yin M, Wang W, Liu Y, Jiang D (2018) An improvement of fp- applicable law.
growth association rule mining algorithm based on adjacency
table. In: MATEC web of conferences, vol189, p 10012. EDP
Sciences
13
1. use such content for the purpose of providing other users with access on a regular or large scale basis or as a means to circumvent access
control;
2. use such content where to do so would be considered a criminal or statutory offence in any jurisdiction, or gives rise to civil liability, or is
otherwise unlawful;
3. falsely or misleadingly imply or suggest endorsement, approval , sponsorship, or association unless explicitly agreed to by Springer Nature in
writing;
4. use bots or other automated methods to access the content or redirect messages
5. override any security feature or exclusionary protocol; or
6. share the content in order to create substitute for Springer Nature products or services or a systematic database of Springer Nature journal
content.
In line with the restriction against commercial use, Springer Nature does not permit the creation of a product or service that creates revenue,
royalties, rent or income from our content or its inclusion as part of a paid for service or for other commercial gain. Springer Nature journal
content cannot be used for inter-library loans and librarians may not upload Springer Nature journal content on a large scale into their, or any
other, institutional repository.
These terms of use are reviewed regularly and may be amended at any time. Springer Nature is not obligated to publish any information or
content on this website and may remove it or features or functionality at our sole discretion, at any time with or without notice. Springer Nature
may revoke this licence to you at any time and remove access to any copies of the Springer Nature journal content which have been saved.
To the fullest extent permitted by law, Springer Nature makes no warranties, representations or guarantees to Users, either express or implied
with respect to the Springer nature journal content and all parties disclaim and waive any implied warranties or warranties imposed by law,
including merchantability or fitness for any particular purpose.
Please note that these rights do not automatically extend to content, data or other material published by Springer Nature that may be licensed
from third parties.
If you would like to use or distribute our Springer Nature journal content to a wider audience or on a regular basis or in any other manner not
expressly permitted by these Terms, please contact Springer Nature at