Apriori Algorithm
Apriori Algorithm
Overview
The Apriori algorithm was proposed by Agrawal and Srikant in
1994. Apriori is designed to operate on databases containing
transactions (for example, collections of items bought by
customers, or details of a website frequentation or IP
addresses[2]). Other algorithms are designed for finding
association rules in data having no transactions (Winepi and
Minepi), or having no timestamps (DNA sequencing). Each
transaction is seen as a set of items (an itemset). Given a
threshold , the Apriori algorithm identifies the item sets which
are subsets of at least transactions in the database.
Examples
Example 1 …
Consider the following database, where each row is a transaction
and each cell is an individual item of the transaction:
Example 2 …
{1,2,3,4}
{1,2,4}
{1,2}
{2,3,4}
{2,3}
{3,4}
{2,4}
Item Support
{1} 3
{2} 6
{3} 4
{4} 5
For example, regarding the pair {1,2}: the first table of Example 2
shows items 1 and 2 appearing together in three of the itemsets;
therefore, we say item {1,2} has support of three.
Item Support
{1,2} 3
{1,3} 1
{1,4} 2
{2,3} 3
{2,4} 4
{3,4} 3
The pairs {1,2}, {2,3}, {2,4}, and {3,4} all meet or exceed the
minimum support of 3, so they are frequent. The pairs {1,3} and
{1,4} are not. Now, because {1,3} and {1,4} are not frequent, any
larger set which contains {1,3} or {1,4} cannot be frequent. In this
way, we can prune sets: we will now look for frequent triples in the
database, but we can already exclude all the triples that contain
one of these two pairs:
Item Support
{2,3,4} 2
Limitations
Apriori, while historically significant, suffers from a number of
inefficiencies or trade-offs, which have spawned other algorithms.
Candidate generation generates large numbers of subsets (The
algorithm attempts to load up the candidate set, with as many as
possible subsets before each scan of the database). Bottom-up
subset exploration (essentially a breadth-first traversal of the
subset lattice) finds any maximal subset S only after all
of its proper subsets.
The algorithm scans the database too many times, which reduces
the overall performance. Due to this, the algorithm assumes that
the database is Permanent in the memory.
Also, both the time and space complexity of this algorithm are
very high: , thus exponential, where is the horizontal
References
1. Rakesh Agrawal and Ramakrishnan Srikant Fast algorithms for
mining association rules . Proceedings of the 20th
International Conference on Very Large Data Bases, VLDB,
pages 487-499, Santiago, Chile, September 1994.
2. The data science behind IP address matching Published by
deductive.com, September 6, 2018, retrieved September 7,
2018
3. Bayardo Jr, Roberto J. (1998). "Efficiently mining long patterns
from databases" (PDF). ACM SIGMOD Record. 27 (2).
External links
ARtool , GPL Java association rule mining application with GUI,
offering implementations of multiple algorithms for discovery
of frequent patterns and extraction of association rules
(includes Apriori)
SPMF offers Java open-source implementations of Apriori and
several variations such as AprioriClose, UApriori, AprioriInverse,
AprioriRare, MSApriori, AprioriTID, and other more efficient
algorithms such as FPGrowth and LCM.
Christian Borgelt provides C implementations for Apriori and
many other frequent pattern mining algorithms (Eclat,
FPGrowth, etc.). The code is distributed as free software under
the MIT license.
The R package arules contains Apriori and Eclat and
infrastructure for representing, manipulating and analyzing
transaction data and patterns.
Efficient-Apriori is a Python package with an implementation of
the algorithm as presented in the original paper.