Apriori
Apriori
2
Frequent Itemset Mining
Can be applied on any tabular dataset after binarization
A B C
0.5 0.6 10
0.3 0.7 12
0.9 0.4 9
4
Transactional Data
●
Itemset mining is defined on data that consists of
transactions, where each transaction consists of a set of
items
●
Running example:
{B}
{E}
{A, C}
{A, E}
{B, C}
{D, E}
{C, D, E}
{A, B, C}
{A, B, E}
{A, B, C, E}
5
Transactional Data
●
Transactional data can be represented in a boolean matrix
A B C D E
{B} 0 1 0 0 0
{E} 0 0 0 0 1
{A, C} 1 0 1 0 0
{A, E} 1 0 0 0 1
{B, C} 0 1 1 0 0
{D, E} 0 0 0 1 1
{C, D, E} 0 0 1 1 1
{A, B, C} 1 1 1 0 0
{A, B, E} 1 1 0 0 1
{A, B, C, E} 1 1 1 0 1
6
Transactional Data
●
More formally, we represent:
¨ All possible items using a set , e.g.
¨ All possible transaction identifiers using a set , e.g.:
●
A function that returns a boolean for every transaction and
item:
For example,
●
A boolean matrix indexed with transaction and item ids
7
Cover and Support
●
An itemset matches or covers a transaction
iff ; more formally:
●
The cover of an itemset in a database is the set of
identifiers of transactions it covers in the database:
●
The support of an itemset in a database is the
size of the cover:
8
Cover and Support
●
Example for: 1 {B}
2 {E}
3 {A, C}
4 {A, E}
5 {B, C}
6 {D, E}
7 {C, D, E}
8 {A, B, C}
9 {A, B, E}
10 {A, B, C, E}
9
Frequency & Frequent Itemsets
●
The frequency (or relative support) is the normalized
support:
●
An itemset is frequent iff its support is above a given
minimum support threshold , i.e. iff
●
The task of frequent itemset mining is the task of finding
all itemsets that are frequent, i.e., to find the set
10
Frequency & Frequent Itemsets
●
Example for: 1 {B}
2 {E}
3 {A, C}
4 {A, E}
5 {B, C}
6 {D, E}
7 {C, D, E}
8 {A, B, C}
9 {A, B, E}
10 {A, B, C, E}
●
→ if , is a frequent
itemset
●
→ if , is infrequent
11
Frequency & Frequent Itemsets
1 {B}
2 {E}
3 {A, C}
4 {A, E}
5 {B, C}
6 {D, E}
7 {C, D, E}
8 {A, B, C}
9 {A, B, E}
●
{A,C}? 10 {A, B, C, E}
●
{C,D}?
●
{A,B,E}?
12
Diagram of All Itemsets
Is-a-direct-subset-of:
13
1 {B}
2 {E}
14
Complexity of
Frequent Itemset Mining
●
How many potential frequent itemsets are there?
●
Is there a database and a minimum support threshold for
which this number of frequent itemsets is obtained?
16
Early History of
Frequent Itemset Mining
●
1993: Frequent itemset mining was defined as a task by
Agrawal, Imieliński, and Srikant
¨ Although it was called “large itemset mining” at the time
●
1994: The first good algorithm for frequent itemset mining was
published independently by two teams:
¨ Agrawal and Srikant from IBM,
calling the algorithm “Apriori”
(21.608 citations)
¨ Mannila, Toivonen and Verkamo
from the University of Helsinki,
calling the algorithm “OCD”
Agrawal Mannila
(1.019 citations)
●
1996: Agrawal, Srikant, Mannila, Toivonen and Verkamo
published a joint paper summarizing their algorithm
17
Apriori
Two key ideas:
18
Apriori: Anti-Monotonicity
●
A function on sets is called anti-monotonic iff:
●
Support is anti-monotonic:
Proof: if , then
Hence,
Hence,
Hence,
And finally,
19
Apriori: Anti-Monotonicity
●
Any subset of a frequent itemset is also frequent
●
Any superset of an infrequent itemset is also infrequent
20
Apriori: Level-Wise Search
Level 0
Level 1
Level 2
Level 3
Level 4
Level 5
21
Apriori: Outline
Apriori( ):
i=0
do
Generate candidates at level i from itemsets (if )
Determine support of candidates in data
= Frequent itemsets in
i=i+1
while is not empty
●
Anti-monotonicity is used to reduce the amount of itemsets
for which the support is calculated (“candidates”)
¨ Don’t count itemsets which have an infrequent subset
22
Apriori: Level-Wise Search
Level 0
Level 1
Level 2
Level 3
Level 4
Level 5
23
Apriori: Counting Candidates
●
All candidates at a certain level can be counted using a
single pass through the data
●
Outline:
●
All frequent itemsets are found using at most passes
through the data
24
Apriori: Generating Candidates
●
Naïve process:
●
Inefficient: would need to generate all subsets, many of
which may not be reasonable candidates
25
Apriori: Generating Candidates
●
More efficient approach:
¨ Convert all frequent itemsets in a level i into strings by sorting the
items in all sets (e.g., alphabetically)
●
{A,B} → AB
●
{A,C} → AC
¨ Generate initial candidates by combining strings that are identical,
except for the last symbol
●
AB and AC → ABC
●
ABC and ABD → ABCD
●
ABC and ABE → ABCE
¨ Advantage: every candidate generated in this process has already
at least two frequent itemsets
26
Apriori: Generating Candidates
Level 0
Level 1
Level 2
Level 3
Level 4
Level 5
27
Apriori: Generating Candidates
After merging itemsets, we can still check the other subsets to eliminate other candidates
Level 0
Level 1
Level 2
Level 3
Level 4
Level 5
Only itemset that can be
pruned additionally
Itemset generated by this frequent itemset merging process
28
Apriori: Generating Candidates
●
Common implementation choices:
¨ Do not perform the additional anti-monotonicity pruning, as the
number of candidates eliminated additionally is often small
¨ Store frequent itemsets and candidates in a trie (also called a
prefix tree) to make merging more efficient
A B C D
B C E C E E E
29
Apriori: Generating Candidates
●
Merging itemsets = copying leaves, followed by deletion
A B C D
B C E C E E E
A B C D
B C E C E E E
C E E E
30
Apriori: Counting Candidates
●
The trie can also be used to make candidate counting more
efficient
¨ Traverse the trie, check the presence of its items in a transaction
A B
B C C
C E E E
+1
31
Apriori: Counting Candidates
●
The trie can also be used to make candidate counting more
efficient
if v is a leaf then
increase support counter for node v
else
for all children i of node v do
if item i in transaction t then
Determine support for node i in transaction t
32
First Programming Exercise
●
Create two variations of the Apriori algorithm
●
They should differ in optimisations:
¨ Eg. with and without subset pruning
¨ Eg. with and without “intelligent” elimination of candidates
●
Compare their performance on a number of datasets, for
varying thresholds
●
Do your optimizations work?
34
Frequent Itemset Mining
Challenge
●
A “FIMI” challenge was organized in 2003 and 2004
aiming to develop the most efficient itemset mining
algorithm
http://fimi.ua.ac.be
●
This led to results such as these: Implementations of Apriori
35