Comparing Dataset Characteristics That Favor The Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms
Comparing Dataset Characteristics That Favor The Apriori, Eclat or FP-Growth Frequent Itemset Mining Algorithms
Abstract—Frequent itemset mining is a popular data mining mented by Christian Borgelt in 2012[9]. Though, association
technique. Apriori, Eclat, and FP-Growth are among the most rule mining is a similar algorithm, this research is limited to
common algorithms for frequent itemset mining. Considerable frequent itemset mining. By limiting the experimentation to a
research has been performed to compare the relative performance
between these three algorithms, by evaluating the scalability of single implementation of frequent itemset mining this research
each algorithm as the dataset size increases. While scalability is able to evaluate how the characteristics of the dataset affect
as data size increases is important, previous papers have not the performance of these algorithms.
examined the performance impact of similarly sized datasets that
contain different itemset characteristics.
This paper explores the effects that two dataset characteristics II. F REQUENT I TEMSET M INING
can have on the performance of these three frequent itemset algo-
rithms. To perform this empirical analysis, a dataset generator Frequent itemset mining was introduced as a means to find
is created to measure the effects of frequent item density and frequent groupings of items in a database containing basket-
the maximum transaction size on performance. The generated s/transactions of these items[10]. The database is composed
datasets contain the same number of rows. This provides some
of a series of baskets that are analogous to orders placed by
insight into dataset characteristics that are conducive to each
algorithm. The results of this paper’s research demonstrate Eclat customers. These orders are individual baskets that are made
and FP-Growth both handle increases in maximum transaction up of some number of items. Companies, such as Amazon,
size and frequent itemset density considerably better than the Netflix and other online retailers, make use of frequent item-
Apriori algorithm. sets to suggest additional items that a consumer might want
to purchase, based on their past purchasing history and the
I. I NTRODUCTION
history of others with similar baskets[11]. The following data
The research covered by this paper determines how the show baskets that might be used for frequent itemset mining,
characteristics of a dataset might affect the performance where each line represents a single basket of items.
of the Apriori, Eclat, and FP-Growth frequent itemset
mining algorithms. These algorithms have several popular [ m p 3 p l a y e r usb−c h a r g e r book−d c t book−t h s ]
implementations[1], [2], [3]. The goal of this research is [ m p 3 p l a y e r usb−c h a r g e r ]
to determine the effects of basket size and frequent itemset [ usb−c h a r g e r m p 3 p l a y e r book−d c t book−t h s ]
density on the Apriori, Eclat, and FP-Growth algorithms. The [ usb−c h a r g e r ]
research determined that these two dataset characteristics have [ book−d c t book−t h s ]
a significant impact on performance of the algorithms.
From the above baskets several frequent itemsets can be
Most research into frequent itemset mining focuses upon the
defined. These are sets of items that frequently occur together,
performance differences between frequent itemset algorithms
some of which are:
on a single dataset[4]. The effects of hyper-paramaters, such as
minimum support, upon the performance of frequent itemset [ m p 3 p l a y e r usb−c h a r g e r ]
mining algorithms has also been explored[5]. Some papers [ book−d c t book−t h s ]
make use of common datasets from the UCI Machine Learning ...
Repository[6]. Many papers make use of the IBM Quest
Synthetic Data Generator[7] or some variant of it. Our paper A simple visual analysis of the data show that the items
makes use of a Python-based generator that is based on IBM’s mp3-player and usb-charger frequently occur together. Like-
work[8]. wise, book-dct and book-ths also frequently occur together.
This research evaluates the performance of the Apriori, Eclat Frequent itemset algorithms make use of a variety of statistics
and FP-Growth frequent itemset mining algorithms imple- to determine which itemsets to include.
III. S URVEY OF A PRIORI , E CLAT AND FP-G ROWTH threshold level of support. The constant S or σ typically rep-
There are a variety of different algorithms that are used resents the support threshold. Computing all possible itemsets
to mine frequent itemsets. First, a simple naive brute-force is only an O(N ) magnitude operation in all cases, where N
algorithm to build frequent itemsets will be evaluated. This is the number of baskets in the database. However, the naive
paper shows how Apriori, Eclat and FP-Growth address some algorithm would also need 2i memory cells to store all of these
of the shortcomings of the naive algorithm. itemsets as the counts are generated, where i is the number of
All four algorithms must calculate statistics about itemsets individual items. These memory cells would typically be 32 or
that might ultimately be included in the final collection of 64-bit integers. This memory requirement means that the naive
frequent itemsets. One statistic that is common to all four algorithm is impossible for anything but a trivial number of
of these algorithms is support. The support of a candidate individual items. A computer with 128GB of available RAM
frequent itemset is the total count of how many of the database would theoretically only be able to calculate 34 items, when
baskets support that candidate. A basket is said to cover a using a 64-bit integer to hold the counts. When it is considered
candidate itemset if the candidate is a subset or equal to that N might be the total count of distinct items for sale by
the basket. Support is sometimes expressed as a percent of a retailer such as Walmart or Amazon it is obvious that the
the total number of baskets in the database (N ) that cover a naive approach is not useful in practice.
candidate itemset (X). The following formula calculates the V. NAIVE A LGORITHM E XAMPLE
support percentage of a candidate itemset: This section demonstrates how the naive algorithm would
Xcount handle the example basket set given earlier in this paper. The
supp(X) = (1) total number of items contained in database, |J | is equal to
N
four. Four items can be arranged a total of |J |2 , or 16 different
This equation can be applied to calculate the support for ways. However, because one of these frequent itemsets is
{mp3-player usb-charger} from the previously presented set the empty set, only the following 15 candidate itemsets are
of baskets. considered:
3 [ book−t h s ]
supp({mp3-player usb-charger}) = = 0.6 (2)
5 [ book−d c t ]
The support statistic of 0.6 indicates that 60% of the [ book−d c t , book−t h s ]
five baskets contain the candidate itemset {mp3-player usb- [ usb−c h a r g e r ]
charger}. Most frequent itemset algorithms accept a minimum [ usb−c h a r g e r , book−t h s ]
support parameter to filter out less common itemsets. [ usb−c h a r g e r , book−d c t ]
[ usb−c h a r g e r , book−d c t , book−t h s ]
IV. NAIVE A LGORITHM FOR F INDING F REQUENT [ mp3player ]
I TEMSETS [ m p 3 p l a y e r , book−t h s ]
It is not difficult to extract frequent itemsets from basket [ m p 3 p l a y e r , book−d c t ]
data. It is, however, difficult to do so efficiently. For the [ m p 3 p l a y e r , book−d c t , book−t h s ]
algorithms presented here, let J represent a set of items, [ m p 3 p l a y e r , usb−c h a r g e r ]
likewise, let D represent a database of baskets that are made [ m p 3 p l a y e r , usb−c h a r g e r , book−t h s ]
up of those items. Algorithm 1 is a summarization of the [ m p 3 p l a y e r , usb−c h a r g e r , book−d c t ]
naive frequent itemset algorithm provided by Garcia-Molina, [ m p 3 p l a y e r , usb−c h a r g e r , book−d c t , book−t h s ]
Ullman, and Widom[12]. The above itemsets are considered candidate frequent item-
sets because it has not yet been determined if all of these
Algorithm 1 Naive Frequent Itemset Algorithm candidates will be included in the final list of frequent itemsets.
1: INPUT: A file D consisting of baskets of items. Once the candidate itemsets have been determined, the naive
2: OUTPUT: The sets of itemsets F1 , F2 , . . . , Fq , where Fi algorithm will pass over all baskets and count the support for
is the set of all itemsets of size I that appear in at least s each of the candidate itemsets. Candidate itemsets that are be-
baskets of D. low the required support S will be purged. The naive algorithm
3: METHOD: would calculate support for each candidate as follows:
4: R ← integer array, all item combinations in D, of size 2|D| [ book−t h s ] ; s = 3
5: for n ← 1 T O |D| do [ book−d c t ] ; s = 3
6: F ← all possible set combinations from Dn [ book−d c t book−t h s ] ; s = 3
7: Increase each value in R, corresponding to each in F [] [ usb−c h a r g e r ] ; s = 4
return all itemsets, with R[] ≥ s
[ usb−c h a r g e r , book−t h s ] ; s = 2
[ usb−c h a r g e r , book−d c t ] ; s = 2
The naive algorithm simply generates all possible itemsets, [ usb−c h a r g e r , book−d c t , book−t h s ] ; s = 2
counts their support, and then discards all itemsets below some [ mp3player ] ; s = 3
[ mp3player , book−t h s ] ; s = 2 (2003) and others do not represent Apriori’s performance in
[ mp3player , book−d c t ] ; s = 2 terms of big-O notation[14]. This is likely due to the fact that
[ mp3player , book−d c t , book−t h s ] ; s = 2 Apriori’s outer loops are bounded by the number of common
[ mp3player , usb−c h a r g e r ] ; s = 3 prefixes and not some easily determined constants such as the
[ mp3player , usb−c h a r g e r , book−t h s ] ; s = 2 number of items or the length of the dataset. Papers describing
[ mp3player , usb−c h a r g e r , book−d c t ] ; s = 2 Apriori, Eclat, and FP-Growth rely on empirical comparison
[ mp3player , usb−c h a r g e r , book−d c t , of algorithms rather than big-O analysis. However, analysis
book−t h s ]; s = 2 covered later in this paper does allow these three algorithms
to be expressed in big-O based on average basket size and
It is necessary to store a count for every possible itemset when
frequent itemset density.
using the naive algorithm. Of course, once the support counts
Aprioiri first builds a list of all singleton itemsets with
are determined, many of the frequent itemsets will be purged.
sufficient support. Building on the monotonicity principle, the
Nevertheless, the fact that these values must be kept while
next set of candidate frequent itemsets is built of combinations
the database is scanned for support means the naive algorithm
of the singleton itemsets. This process continues until the
requires considerable memory.
maximum length specified for frequent itemsets is reached.
VI. A PRIORI A LGORITHM FOR F INDING F REQUENT The evaluations performed by this research did not impose
I TEMSETS this maximum length.
Agrawal and Srikant initially introduced the Apriori al- The primary issue with the Apriori algorithm is that it is
gorithm to provide performance improvements over a naive necessary to perform a scan of the database at every level of
itemset search[13]. Apriori algorithm has been around almost the breadth-first search. Additionally, candidate generation can
as long as the concept of frequent itemsets and is very popular. lead to a great number of subsets and can become a significant
The naive algorithm is a theoretical concept and is not used memory requirement. Deficiencies in the Apriori algorithm led
in practice. Aprioiri has become the classic implementation to the development of other, more efficient, algorithms, such
of frequent itemset mining. Aprioiri, as defined by Goethals as Eclat and FP-Growth.
(2003) is presented as Algorithm 2[14].
VII. A PRIORI A LGORITHM E XAMPLE
Algorithm 2 Apriori Frequent Itemset Algorithm This section will demonstrate how the Apriori algorithm
1: INPUT: A file D consisting of baskets of items, a support handles the basket set given earlier in this paper. The Apriori
threshold σ. algorithm performs a breadth first search of the itemsets.
2: OUTPUT: A list of itemsets F(D, σ). Figure 1 shows a segment of this search, for the items usb-
3: METHOD: cable, mp3-player, and book-dct.
4: C1 ← {{i}|i ∈ J }
5: k ← 1
6: while Ck 6= {} do {}
Apriori is based on the hierarchical monotonicity of frequent The candidate set starts empty, and begins by adding all
itemsets between their supersets and subsets. As implied by singleton itemsets that have sufficient individual support. For
monotonicity, a subset of a frequent itemset must also be simplicity, it is assumed that only usb-cable, mp3-player, and
frequent. Likewise, a superset of an infrequent itemset must book-dct have sufficient support. The next layer is built of
also be infrequent[13]. This allows the Apriori algorithm to combinations of the previous layer that had sufficient support.
be implemented as a breadth-first search. Papers by Goethals For simplicity, it is also assumed that all three combinations
had sufficient support. Finally, a triplet itemset with all three are encountered, they are added to the trie by inserting a node
items is evaluated. for each item that makes up the itemset. The left-most item
corresponds to a child of the root node. The second item
VIII. E CLAT A LGORITHM FOR F INDING F REQUENT corresponds to a child of the first item of this frequent set.
I TEMSETS No parent would ever have more than one child of the same
Eclat was introduced by Zaki, Parthasarathy, Ogihara, and item name; however, an item name may appear at multiple
Li in 1997[15]. Eclat is an acronym for Equivalence Class locations in the trie.
Clustering and bottom up Lattice Traversal. The primary The trie is generated so that the algorithm can quickly find
difference between Eclat and Apriori is that Eclat abandons the support of an itemset by traversing the trie as the items
Apriori’s breadth-first search for a recursive depth-first search. in the set are read left-to-right. The node that contains the
Eclat, as defined by Goethals (2003) is presented as Algorithm right-most item contains the support for that itemset. As the
3[14]. algorithm processes the database the trie is traversed looking
for each itemset discovered. Nodes are created, if necessary,
Algorithm 3 Eclat Frequent Itemset Algorithm to fill out the trie to hold all itemsets. If the nodes already
1: INPUT: A file D consisting of baskets of items, a support exist, the node for the right-most item in the itemset has its
threshold σ, and an item prefix I, such that I ⊆ J . support increased. New nodes start with a support of 1. This
2: OUTPUT: A list of itemsets F[I](D, σ) for the specified allows Eclat to use less memory than Apriori, because the core
prefix. branches of the trie allow heavily used subsets to be stored
3: METHOD: only once. Theoretically, a trie could be used with Apriori,
4: F[i] ← {} however, the breadth-first nature of Apriori would typically
5: for all i ∈ J occurring in D do require too much memory.
6: F[I] := F[I] ∪ {I ∪ {i}} IX. E CLAT A LGORITHM E XAMPLE
7: # Create Di
This section will demonstrate how the Eclat algorithm
8: Di ← {}
would handle the basket set given earlier in this paper. A
9: for all j ∈ J occurring in D such that j > i do
portion of the trie built by Eclat is shown in Figure 2.
10: C ← cover({i}) ∩ cover({j})
11: if |C| ≥ σ then
12: Di ← Di ∪ {j, C} null