0% found this document useful (0 votes)
14 views33 pages

Apriori

The document discusses Frequent Itemset Mining (FIM) and the Apriori algorithm, which is used to identify frequent itemsets in transactional data. It explains the concepts of support, cover, and the anti-monotonicity principle that underlies the Apriori algorithm, as well as its efficiency improvements through candidate generation and counting. Additionally, it mentions the historical context of FIM and the challenges associated with optimizing the algorithm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views33 pages

Apriori

The document discusses Frequent Itemset Mining (FIM) and the Apriori algorithm, which is used to identify frequent itemsets in transactional data. It explains the concepts of support, cover, and the anti-monotonicity principle that underlies the Apriori algorithm, as well as its efficiency improvements through candidate generation and counting. Additionally, it mentions the historical context of FIM and the challenges associated with optimizing the algorithm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

LINFO2364: Mining Patterns in Data

Frequent Itemset Mining


& Apriori
Siegfried Nijssen
Transactional Data

Itemset mining is defined on data that consists of
transactions, where each transaction consists of a set of
items

2
Frequent Itemset Mining
Can be applied on any tabular dataset after binarization
A B C
0.5 0.6 10
0.3 0.7 12
0.9 0.4 9

A=[0-0.5] A=]0.5-1] B=[0-0.5] B=]0.5-1] C=[0-10] C=]10-20]


1 0 0 1 1 0
1 0 0 1 0 1
0 1 1 0 1 0
3
Frequent Itemset Mining

Examples of data analyzed using FIM in the literature:
¨ Supermarket products per visitor
¨ Web pages accessed per visitor
¨ Binarized gene expression for patients
¨ Transcription factor binding sites of genes
¨ Single-nucleotide polymorphisms of patients

4
Transactional Data

Itemset mining is defined on data that consists of
transactions, where each transaction consists of a set of
items

Running example:
{B}
{E}
{A, C}
{A, E}
{B, C}
{D, E}
{C, D, E}
{A, B, C}
{A, B, E}
{A, B, C, E}

5
Transactional Data

Transactional data can be represented in a boolean matrix

A B C D E
{B} 0 1 0 0 0
{E} 0 0 0 0 1
{A, C} 1 0 1 0 0
{A, E} 1 0 0 0 1
{B, C} 0 1 1 0 0
{D, E} 0 0 0 1 1
{C, D, E} 0 0 1 1 1
{A, B, C} 1 1 1 0 0
{A, B, E} 1 1 0 0 1
{A, B, C, E} 1 1 1 0 1

6
Transactional Data

More formally, we represent:
¨ All possible items using a set , e.g.
¨ All possible transaction identifiers using a set , e.g.:

¨ A transactional database using one of these equivalent formalisms:



A function that returns for every transaction identifier a set of
items:
For example,


A function that returns a boolean for every transaction and
item:
For example,


A boolean matrix indexed with transaction and item ids
7
Cover and Support

An itemset matches or covers a transaction
iff ; more formally:


The cover of an itemset in a database is the set of
identifiers of transactions it covers in the database:


The support of an itemset in a database is the
size of the cover:

8
Cover and Support

Example for: 1 {B}
2 {E}
3 {A, C}
4 {A, E}
5 {B, C}
6 {D, E}
7 {C, D, E}
8 {A, B, C}
9 {A, B, E}
10 {A, B, C, E}

9
Frequency & Frequent Itemsets

The frequency (or relative support) is the normalized
support:


An itemset is frequent iff its support is above a given
minimum support threshold , i.e. iff

or, alternatively, iff


The task of frequent itemset mining is the task of finding
all itemsets that are frequent, i.e., to find the set

10
Frequency & Frequent Itemsets

Example for: 1 {B}
2 {E}
3 {A, C}
4 {A, E}
5 {B, C}
6 {D, E}
7 {C, D, E}
8 {A, B, C}
9 {A, B, E}
10 {A, B, C, E}


→ if , is a frequent
itemset

→ if , is infrequent
11
Frequency & Frequent Itemsets
1 {B}
2 {E}
3 {A, C}
4 {A, E}
5 {B, C}
6 {D, E}
7 {C, D, E}
8 {A, B, C}
9 {A, B, E}

{A,C}? 10 {A, B, C, E}


{C,D}?

{A,B,E}?

12
Diagram of All Itemsets
Is-a-direct-subset-of:

13
1 {B}
2 {E}

Diagram of All Itemsets 3


4
{A, C}
{A, E}
5 {B, C}
6 {D, E}
7 {C, D, E}

Itemset is frequent for 8 {A, B, C}


9 {A, B, E}
10 {A, B, C, E}

14
Complexity of
Frequent Itemset Mining


How many potential frequent itemsets are there?

where n is the number of items


Is there a database and a minimum support threshold for
which this number of frequent itemsets is obtained?

16
Early History of
Frequent Itemset Mining

1993: Frequent itemset mining was defined as a task by
Agrawal, Imieliński, and Srikant
¨ Although it was called “large itemset mining” at the time

1994: The first good algorithm for frequent itemset mining was
published independently by two teams:
¨ Agrawal and Srikant from IBM,
calling the algorithm “Apriori”
(21.608 citations)
¨ Mannila, Toivonen and Verkamo
from the University of Helsinki,
calling the algorithm “OCD”
Agrawal Mannila
(1.019 citations)

1996: Agrawal, Srikant, Mannila, Toivonen and Verkamo
published a joint paper summarizing their algorithm
17
Apriori
Two key ideas:

1) Anti-monotonicity 2) Level-wise search

18
Apriori: Anti-Monotonicity

A function on sets is called anti-monotonic iff:


Support is anti-monotonic:

Proof: if , then
Hence,
Hence,
Hence,
And finally,

19
Apriori: Anti-Monotonicity

Any subset of a frequent itemset is also frequent

Any superset of an infrequent itemset is also infrequent

20
Apriori: Level-Wise Search

Level 0

Level 1

Level 2

Level 3

Level 4

Level 5

21
Apriori: Outline
Apriori( ):
i=0
do
Generate candidates at level i from itemsets (if )
Determine support of candidates in data
= Frequent itemsets in
i=i+1
while is not empty


Anti-monotonicity is used to reduce the amount of itemsets
for which the support is calculated (“candidates”)
¨ Don’t count itemsets which have an infrequent subset
22
Apriori: Level-Wise Search

Level 0

Level 1

Level 2

Level 3

Level 4

Level 5

Itemset included in Itemset pruned as candidate

23
Apriori: Counting Candidates

All candidates at a certain level can be counted using a
single pass through the data

Outline:

Determine support of candidates in data :

for every transaction do


for every candidate do
if then
increase counter for candidate C


All frequent itemsets are found using at most passes
through the data

24
Apriori: Generating Candidates

Naïve process:

Generate candidates at level i from itemsets :

for all subsets C of with |C|=i do


if all subsets of C occur in then
add C to the set of candidates


Inefficient: would need to generate all subsets, many of
which may not be reasonable candidates

25
Apriori: Generating Candidates

More efficient approach:
¨ Convert all frequent itemsets in a level i into strings by sorting the
items in all sets (e.g., alphabetically)

{A,B} → AB

{A,C} → AC
¨ Generate initial candidates by combining strings that are identical,
except for the last symbol

AB and AC → ABC

ABC and ABD → ABCD

ABC and ABE → ABCE
¨ Advantage: every candidate generated in this process has already
at least two frequent itemsets

26
Apriori: Generating Candidates

Level 0

Level 1

Level 2

Level 3

Level 4

Level 5

Itemset generated by this frequent itemset merging process

27
Apriori: Generating Candidates
After merging itemsets, we can still check the other subsets to eliminate other candidates

Level 0

Level 1

Level 2

Level 3

Level 4

Level 5
Only itemset that can be
pruned additionally
Itemset generated by this frequent itemset merging process

28
Apriori: Generating Candidates

Common implementation choices:
¨ Do not perform the additional anti-monotonicity pruning, as the
number of candidates eliminated additionally is often small
¨ Store frequent itemsets and candidates in a trie (also called a
prefix tree) to make merging more efficient

A B C D

B C E C E E E

Trie for all frequent itemsets of size


2 in the running example

29
Apriori: Generating Candidates

Merging itemsets = copying leaves, followed by deletion

A B C D

B C E C E E E

A B C D

B C E C E E E

C E E E
30
Apriori: Counting Candidates

The trie can also be used to make candidate counting more
efficient
¨ Traverse the trie, check the presence of its items in a transaction

A B

B C C

C E E E
+1

31
Apriori: Counting Candidates

The trie can also be used to make candidate counting more
efficient

Determine support for node v in candidate trie in


transaction t:

if v is a leaf then
increase support counter for node v
else
for all children i of node v do
if item i in transaction t then
Determine support for node i in transaction t

Call this function for the root of the trie

32
First Programming Exercise

Create two variations of the Apriori algorithm


They should differ in optimisations:
¨ Eg. with and without subset pruning
¨ Eg. with and without “intelligent” elimination of candidates


Compare their performance on a number of datasets, for
varying thresholds


Do your optimizations work?

34
Frequent Itemset Mining
Challenge

A “FIMI” challenge was organized in 2003 and 2004
aiming to develop the most efficient itemset mining
algorithm

http://fimi.ua.ac.be

This led to results such as these: Implementations of Apriori

35

You might also like