Session 8-Association Rules Mining
Session 8-Association Rules Mining
1
Contents
1 Introduction
2 Mining algorithms
2
1. Introduction
❖ Data Mining
▪ Data processing
▪ Data warehouses and OLAP
▪ Association Rules Mining
▪ Classification
▪ Clustering
▪ Sequential Patterns Mining
▪ Advanced topics: outlier detection, web mining
3
1. Introduction
❖ Proposed by Agrawal et al. in 1993. It is an important
data mining model studied extensively by the database
and data mining community
4
The model: data
❖ Transaction t :
▪ t a set of items, and t ⊆ I
5
Example of Transaction database: supermarket data
❖ Concepts:
▪ An item: an item/article in a basket
▪ I: the set of all items sold in the store
▪ A transaction: items purchased in a basket; it may
have TID (transaction ID)
▪ A transactional dataset: A set of transactions
6
Example of transaction database: a set of documents
7
The model: rules
❖ A transaction t contains X, a set of items
(itemset) in I, if X ⊆ t
10
Rule Measures: Support and Confidence
Customer
buys both Find all the rules X & Y ⇒ Z with minimum
Customer
buys diaper confidence and support
▪ support, s, probability that a transaction
contains {X Y Z}
▪ confidence, c, conditional probability
that a transaction having {X Y} also
Customer contains Z
buys beer
11
Goal and key features
12
An example
t1: Beef, Chicken, Milk
❖ Transaction data t2: Beef, Cheese
t3: Cheese, Boots
❖ Assume: t4: Beef, Chicken, Cheese
minsup = 30% t5: Beef, Chicken, Clothes, Cheese, Milk
t6: Chicken, Clothes, Milk
minconf = 80%
t7: Chicken, Milk, Clothes
14
Transaction data representation
❖ A simplistic view of shopping baskets
15
Association Rule Mining: A Road Map
❖ Boolean vs. quantitative associations (Based on the types of values
handled)
▪ buys(x, “SQLServer”) ^ buys(x, “DMBook”) ® buys(x, “DBMiner”)
[0.2%, 60%]
▪ age(x, “30..39”) ^ income(x, “42..48K”) ® buys(x, “PC”) [1%, 75%]
❖ Various extensions
▪ Correlation, causality analysis
• Association does not necessarily imply correlation or causality
▪ Constraints enforced
• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
16
Contents
1 Introduction
2 Mining algorithms
17
2. Mining algorithms
❖ They use different strategies and data structures
❖ Naïve Algorithm:
19
Discovering Rules (2)
20
Mining Frequent Itemsets: the Key Step
21
The Apriori algorithm
t1: Beef, Chicken, Milk
t2: Beef, Cheese
t3: Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese, Milk
t6: Chicken, Clothes, Milk
t7: Chicken, Milk, Clothes
AB AC AD BC BD CD
A B C D
23
The Algorithm
❖ Iterative algo. (also called level-wise search): Find all
1-item frequent itemsets; then all 2-item frequent
itemsets, and so on
▪ In each iteration k, only consider itemsets that contain
some k-1 frequent itemset.
❖ From k = 2
▪ Ck = candidates of size k: those itemsets of size k that
could be frequent, given Fk-1
▪ Fk = those itemsets that are actually frequent, Fk ⊆ Ck
(need to scan the database once).
24
Example – Finding frequent itemsets
25
Details: ordering of items
❖ The items in I are sorted in lexicographic order (which is
a total order).
26
Details: the algorithm
Algorithm Apriori(T)
C1 ← init-pass(T);
F1 ← {f | f ∈ C1, f.count/n ≥ minsup}; // n: no. of transactions in T
for (k = 2; Fk-1 ≠ ∅; k++) do
Ck ← candidate-gen(Fk-1);
for each transaction t ∈ T do
for each candidate c ∈ Ck do
if c is contained in t then
c.count++;
end
end
Fk ← {c ∈ Ck | c.count/n ≥ minsup}
end
return F ← k Fk;
27
Apriori candidate generation
❖ The candidate-gen function takes Fk-1 and returns a
superset (called the candidates) of the set of all
frequent k-itemsets. It has two steps:
▪ join step: Generate all possible candidate itemsets
Ck of length k
▪ prune step: Remove those candidates in Ck that
cannot be frequent
28
Candidate-gen function
Function candidate-gen(Fk-1)
Ck ← ∅;
for all f1, f2 ∈ Fk-1
with f1 = {i1, … , ik-2, ik-1}
and f2 = {i1, … , ik-2, i’k-1}
and ik-1 < i’k-1 do
c ← {i1, …, ik-1, i’k-1}; // join f1 and f2
Ck ← Ck ∪ {c};
for each (k-1)-subset s of c do
if (s ∉ Fk-1) then
delete c from Ck; // prune
end
end
return Ck;
29
An example
❖ F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},
{1, 3, 5}, {2, 3, 4}}
❖ After join
▪ C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}}
❖ After pruning:
▪ C4 = {{1, 2, 3, 4}}
because {1, 4, 5} is not in F3 ({1, 3, 4, 5} is removed)
30
The Apriori Algorithm — Example
Min support =50% = 2 trans
Database D
C1 L1
Scan D
C2 C2
L2 Scan D
C3 Scan D L3
31
Step 2: Generating rules from frequent itemsets
32
Generating rules: an example
❖ Suppose {2,3,4} is frequent, with sup=50%
▪ Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4},
with sup=50%, 50%, 75%, 75%, 75%, 75% respectively
▪ These generate these association rules:
• 2,3 → 4, confidence=100%
• 2,4 → 3, confidence=100%
• 3,4 → 2, confidence=67%
• 2 → 3,4, confidence=67%
• 3 → 2,4, confidence=67%
• 4 → 2,3, confidence=67%
• All rules have support = 50%
33
Generating rules: summary
34
On Apriori Algorithm
Seems to be very expensive
❖ Level-wise search
❖ K = the size of the largest itemset
❖ It makes at most K passes over data
❖ In practice, K is bounded (10)
❖ The algorithm is very fast. Under some conditions, all
rules can be found in linear time
❖ Scale up to large data sets
35
Hash-tree: search
❖ If you are at an interior node and you just used item i, then
use each item that comes after i in T
36
Methods to Improve Apriori’s Efficiency
❖ Transaction reduction: A transaction that does not contain any
frequent k-itemset is useless in subsequent scans
37
Is Apriori Fast Enough? — Performance Bottlenecks
38
Max-Miner
39
Max-Miner: the idea
1 2 3 4
40
Max-miner pruning
41
The algorithm
Max-Miner
Set candidate groups C {}
Set of Itemsets F {Gen-Initial-Groups(T,C)}
while C not empty do
scan T to count the support of all candidate groups in C
for each g in C s.t. h(g) U t(g) is frequent do
F F U {h(g) U t(g)}
Set candidate groups Cnew{ }
for each g in C such that h(g) U t(g) is infrequent do
F F U {Gen-sub-nodes(g, Cnew)}
C
remove from F any itemset with a proper superset in F
remove from C any group g s.t. h(g) U t(g) has a superset in F
return F
42
The algorithm (2)
Gen-Initial-Groups(T, C)
scan T to obtain F1, the set of frequent 1-itemsets
impose an ordering on items in F1
for each item i in F1 other than the greatest itemset do
let g be a new candidate with h(g) = {i}
and t(g) = {j | j follows i in the ordering}
C C U {g}
return the itemset F1 (an the C of course)
43
Item Ordering
44
More on association rule mining
❖ Clearly the space of all association rules is exponential,
O(2m), where m is the number of items in I
45
Contents
1 Introduction
2 Mining algorithms
46
3. Different data formats for mining
❖ The data can be in transaction form or table form
Transaction form: a, b
a, c, d, e
a, d, f
47
Conversion
Table form: Attr1 Attr2 Attr3
a, b, d
b, c, e
⇒ Transaction form:
(Attr1, a), (Attr2, b), (Attr3, d)
(Attr1, b), (Attr2, c), (Attr3, e)
48
Contents
1 Introduction
2 Mining algorithms
49
4. Problems with the association mining
❖ Single minsup: It assumes that all items in the data are
of the same nature and/or have similar frequencies
50
Rare Item Problem
❖ If the frequencies of items vary a great deal, we will
encounter two problems:
▪ If minsup is set too high, those rules that involve rare
items will not be found
▪ To find rules that involve both frequent and rare items,
minsup has to be set very low. This may cause
combinatorial explosion because those frequent items
will be associated with one another in all possible ways
51
Multiple minsups model
52
Minsup of a rule
❖ Let MIS(i) be the MIS value of item i. The minsup of a rule
R is the lowest MIS value of the items in the rule
53
An Example
❖ Consider the following items:
bread, shoes, clothes
55
Solution
❖ We sort all items in I according to their MIS values (make
it a total order)
56
The MSapriori algorithm
Algorithm MSapriori(T, MS)
M ← sort(I, MS);
L ← init-pass(M, T);
F1 ← {{i} | i ∈ L, i.count/n ≥ MIS(i)};
for (k = 2; Fk-1 ≠ ∅; k++) do
if k=2 then
Ck ← level2-candidate-gen(L)
else Ck ← MScandidate-gen(Fk-1);
end;
for each transaction t ∈ T do
for each candidate c ∈ Ck do
if c is contained in t then
c.count++;
if c – {c[1]} is contained in t then
c.tailCount++
end
end
Fk ← {c ∈ Ck | c.count/n ≥ MIS(c[1])}
end
return F ← kFk;
57
Candidate itemset generation
❖ Special treatments needed:
▪ Sorting the items according to their MIS values
▪ First pass over data (the first three lines)
• Let us look at this in detail
▪ Candidate generation at level-2
• Read it in the handout
▪ Pruning step in level-k (k > 2) candidate generation
• Read it in the handout
58
First pass over data
❖ It makes a pass over the data to record the support
count of each item
❖ Assume our data set has 100 transactions. The first pass
gives us the following support counts:
{3}.count = 6, {4}.count = 3,
{1}.count = 9, {2}.count = 25
❖ Why?
61
On multiple minsup rule mining
❖ Multiple minsup model subsumes the single support
model
2 Mining algorithms
63
5. Mining class association rules (CAR)
❖ Normal association rule mining does not have any target
❖ It finds all possible rules that exist in data, i.e., any item
can appear as a consequent or a condition of a rule
64
Problem definition
❖ Let T be a transaction data set consisting of n transactions
65
An example
❖ A text document data set
doc 1: Student, Teach, School : Education
doc 2: Student, School : Education
doc 3: Teach, School, City, Game : Education
doc 4: Baseball, Basketball : Sport
doc 5: Basketball, Player, Spectator : Sport
doc 6: Baseball, Coach, Game, Team : Sport
doc 7: Basketball, Team, City, Game : Sport
❖ Let minsup = 20% and minconf = 60%. The following are two
examples of class association rules:
66
Mining algorithm
❖ Unlike normal association rules, CARs can be mined
directly in one step
❖ For example, we have a data set with two classes, Yes and
No. We may want
▪ rules of class Yes to have the minimum support of 5% and
▪ rules of class No to have the minimum support of 10%
▪
❖ By setting minimum class supports to 100% (or more for some
classes), we tell the algorithm not to generate rules of those
classes
68
Contents
1 Introduction
2 Mining algorithms
69
Summary
❖ Association rule mining has been extensively studied in the
data mining community
71
Exercises
72
Exercises (2)
73
Main reference
74
Click to edit company slogan .
75