0% found this document useful (0 votes)
2 views75 pages

Session 8-Association Rules Mining

The document provides an overview of Association Rules Mining, including its introduction, mining algorithms, data formats, and key concepts such as support and confidence. It discusses the Apriori algorithm for discovering frequent itemsets and generating association rules, emphasizing the importance of minimum support and confidence thresholds. Additionally, it highlights various applications of association rules in market basket analysis and other fields.

Uploaded by

dothaogiangt67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views75 pages

Session 8-Association Rules Mining

The document provides an overview of Association Rules Mining, including its introduction, mining algorithms, data formats, and key concepts such as support and confidence. It discusses the Apriori algorithm for discovering frequent itemsets and generating association rules, emphasizing the importance of minimum support and confidence thresholds. Additionally, it highlights various applications of association rules in market basket analysis and other fields.

Uploaded by

dothaogiangt67
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Session 8: Association Rules Mining

Lecturer: Dr. Le Hoang Son


Vietnam National University (VNU)
[email protected]
[email protected]

1
Contents
1 Introduction

2 Mining algorithms

3 Data formats for mining

4 Multiple minimum supports

5 Mining class association rules

6 Discussion & Exercises

2
1. Introduction
❖ Data Mining
▪ Data processing
▪ Data warehouses and OLAP
▪ Association Rules Mining
▪ Classification
▪ Clustering
▪ Sequential Patterns Mining
▪ Advanced topics: outlier detection, web mining

3
1. Introduction
❖ Proposed by Agrawal et al. in 1993. It is an important
data mining model studied extensively by the database
and data mining community

❖ Assume all data are categorical. No good algorithm for


numeric data

❖ Initially used for Market Basket Analysis to find how


items purchased by customers are related

Bread → Milk [sup = 5%, conf = 100%]

4
The model: data

❖ I = {i1, i2, …, im}: a set of items

❖ Transaction t :
▪ t a set of items, and t ⊆ I

❖ Transaction Database T: a set of


transactions T = {t1, t2, …, tn}

5
Example of Transaction database: supermarket data

❖ Market basket transactions:


t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}

❖ Concepts:
▪ An item: an item/article in a basket
▪ I: the set of all items sold in the store
▪ A transaction: items purchased in a basket; it may
have TID (transaction ID)
▪ A transactional dataset: A set of transactions

6
Example of transaction database: a set of documents

❖ A text document data set. Each document is treated


as a “bag” of keywords

doc1: Student, Teach, School


doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball
doc5: Basketball, Player, Spectator
doc6: Baseball, Coach, Game, Team
doc7: Basketball, Team, City, Game

7
The model: rules
❖ A transaction t contains X, a set of items
(itemset) in I, if X ⊆ t

❖ An association rule is an implication of the form:


X → Y, where X, Y ⊂ I, and X ∩Y = ∅

❖ An itemset is a set of items


▪ E.g., X = {milk, bread, cereal} is an itemset

❖ A k-itemset is an itemset with k items


▪ E.g., {milk, bread, cereal} is a 3-itemset
8
Rule strength measures
❖ Support: The rule holds with support sup in T
(the transaction data set) if sup% of transactions
contain X ∪ Y
▪ sup = Pr(X ∪ Y)

❖ Confidence: The rule holds in T with confidence


conf if conf% of transactions that contain X also
contain Y
▪ conf = Pr(Y | X)

❖ An association rule is a pattern that states when


X occurs, Y occurs with certain probability
9
Support and Confidence
❖ Support count: The support count of an itemset
X, denoted by X.count, in a data set T is the
number of transactions in T that contain X.
Assume T has n transactions.
❖ Then,

10
Rule Measures: Support and Confidence
Customer
buys both Find all the rules X & Y ⇒ Z with minimum
Customer
buys diaper confidence and support
▪ support, s, probability that a transaction
contains {X  Y  Z}
▪ confidence, c, conditional probability
that a transaction having {X  Y} also
Customer contains Z
buys beer

Let minimum support 50%, and


minimum confidence 50%, we have
▪ A ⇒ C (50%, 66.6%)
▪ C ⇒ A (50%, 100%)

11
Goal and key features

❖ Goal: Find all rules that satisfy the user-specified


minimum support (minsup) and minimum
confidence (minconf).
❖ Key Features
▪ Completeness: find all rules.
▪ No target item(s) on the right-hand-side
▪ Mining with data on hard disk (not in memory)

12
An example
t1: Beef, Chicken, Milk
❖ Transaction data t2: Beef, Cheese
t3: Cheese, Boots
❖ Assume: t4: Beef, Chicken, Cheese
minsup = 30% t5: Beef, Chicken, Clothes, Cheese, Milk
t6: Chicken, Clothes, Milk
minconf = 80%
t7: Chicken, Milk, Clothes

❖ An example frequent itemset:


{Chicken, Clothes, Milk} [sup = 3/7]

❖ Association rules from the itemset:


Clothes → Milk, Chicken [sup = 3/7, conf = 3/3]
… …
Clothes, Chicken → Milk, [sup = 3/7, conf = 3/3]
13
Application Examples
❖ Market Basket Analysis
▪ * ⇒ Maintenance Agreement (What the store should do to
boost Maintenance Agreement sales?)

▪ Home Electronics ⇒ * (What other products should the store


stocks up on if the store has a sale on Home Electronics?)

▪ Attached mailing in direct marketing

▪ Detecting “ping-pong”ing of patients


• Transaction: patient
• Item: doctor/clinic visited by patient
• Support of the rule: number of common patients

14
Transaction data representation
❖ A simplistic view of shopping baskets

❖ Some important information not considered:


▪ the quantity of each item purchased and
▪ the price paid

15
Association Rule Mining: A Road Map
❖ Boolean vs. quantitative associations (Based on the types of values
handled)
▪ buys(x, “SQLServer”) ^ buys(x, “DMBook”) ® buys(x, “DBMiner”)
[0.2%, 60%]
▪ age(x, “30..39”) ^ income(x, “42..48K”) ® buys(x, “PC”) [1%, 75%]

❖ Single dimension vs. multiple dimensional associations (see ex. Above)

❖ Single level vs. multiple-level analysis


▪ What brands of beers are associated with what brands of diapers?

❖ Various extensions
▪ Correlation, causality analysis
• Association does not necessarily imply correlation or causality
▪ Constraints enforced
• E.g., small sales (sum < 100) trigger big buys (sum > 1,000)?
16
Contents
1 Introduction

2 Mining algorithms

3 Data formats for mining

4 Multiple minimum supports

5 Mining class association rules

6 Discussion & Exercises

17
2. Mining algorithms
❖ They use different strategies and data structures

❖ Their resulting sets of rules are all the same


▪ Given a transaction data set T, and a minimum
support and a minimum confident, the set of
association rules existing in T is uniquely determined

❖ Any algorithm should find the same set of rules although


their computational efficiencies and memory
requirements may be different.

❖ The Apriori Algorithm


18
Discovering Rules

❖ Naïve Algorithm:

for each frequent itemset l do


for each subset c of l do
if (support(l ) / support(l - c) >= minconf) then
output the rule (l – c ) ⇒ c,
with confidence = support(l ) / support (l - c )
and support = support(l )

19
Discovering Rules (2)

❖ Lemma. If consequent c generates a valid rule, so do all


subsets of c. (e.g. X ⇒ YZ, then XY ⇒ Z and XZ ⇒ Y)

❖ Example: Consider a frequent itemset ABCDE

If ACDE ⇒ B and ABCE ⇒ D are the only


one-consequent rules with minimum support confidence,

then: ACE ⇒ BD is the only other rule that needs to be


tested

20
Mining Frequent Itemsets: the Key Step

❖ Find the frequent itemsets: the sets of items that have


minimum support
▪ A subset of a frequent itemset must also be a frequent
itemset
• i.e., if {AB} is a frequent itemset, both {A} and {B}
should be a frequent itemset
▪ Iteratively find frequent itemsets with cardinality from 1
to k (k-itemset)

❖ Use the frequent itemsets to generate association rules

21
The Apriori algorithm
t1: Beef, Chicken, Milk
t2: Beef, Cheese
t3: Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese, Milk
t6: Chicken, Clothes, Milk
t7: Chicken, Milk, Clothes

▪ Find all itemsets that have minimum support (frequent


itemsets, also called large itemsets)
▪ Use frequent itemsets to generate rules
❖ E.g., a frequent itemset
{Chicken, Clothes, Milk} [sup = 3/7]
and one rule from the frequent itemset
Clothes → Milk, Chicken [sup = 3/7, conf = 3/3]
22
Step 1: Mining all frequent itemsets
❖ A frequent itemset is an itemset whose support is ≥
minsup

❖ Key idea: The apriori property (downward closure


property): any subsets of a frequent itemset are also
frequent itemsets
ABC ABD ACD BCD

AB AC AD BC BD CD

A B C D

23
The Algorithm
❖ Iterative algo. (also called level-wise search): Find all
1-item frequent itemsets; then all 2-item frequent
itemsets, and so on
▪ In each iteration k, only consider itemsets that contain
some k-1 frequent itemset.

❖ Find frequent itemsets of size 1: F1

❖ From k = 2
▪ Ck = candidates of size k: those itemsets of size k that
could be frequent, given Fk-1
▪ Fk = those itemsets that are actually frequent, Fk ⊆ Ck
(need to scan the database once).
24
Example – Finding frequent itemsets

Dataset T TID Items


minsup=0.5 T100 1, 3, 4
itemset:count
T200 2, 3, 5
1. scan T 🡺 C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3 T300 1, 2, 3,
🡺 F1: {1}:2, {2}:3, {3}:3, {5}:3 5
🡺 C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5} T400 2, 5
2. scan T 🡺 C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2
🡺 F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2
🡺 C3: {2, 3,5}
3. scan T 🡺 C3: {2, 3, 5}:2 🡺 F3: {2, 3, 5}

25
Details: ordering of items
❖ The items in I are sorted in lexicographic order (which is
a total order).

❖ The order is used throughout the algorithm in each


itemset.

❖ {w[1], w[2], …, w[k]} represents a k-itemset w consisting


of items w[1], w[2], …, w[k], where w[1] < w[2] < … <
w[k] according to the total order.

26
Details: the algorithm
Algorithm Apriori(T)
C1 ← init-pass(T);
F1 ← {f | f ∈ C1, f.count/n ≥ minsup}; // n: no. of transactions in T
for (k = 2; Fk-1 ≠ ∅; k++) do
Ck ← candidate-gen(Fk-1);
for each transaction t ∈ T do
for each candidate c ∈ Ck do
if c is contained in t then
c.count++;
end
end
Fk ← {c ∈ Ck | c.count/n ≥ minsup}
end
return F ← k Fk;

27
Apriori candidate generation
❖ The candidate-gen function takes Fk-1 and returns a
superset (called the candidates) of the set of all
frequent k-itemsets. It has two steps:
▪ join step: Generate all possible candidate itemsets
Ck of length k
▪ prune step: Remove those candidates in Ck that
cannot be frequent

28
Candidate-gen function
Function candidate-gen(Fk-1)
Ck ← ∅;
for all f1, f2 ∈ Fk-1
with f1 = {i1, … , ik-2, ik-1}
and f2 = {i1, … , ik-2, i’k-1}
and ik-1 < i’k-1 do
c ← {i1, …, ik-1, i’k-1}; // join f1 and f2
Ck ← Ck ∪ {c};
for each (k-1)-subset s of c do
if (s ∉ Fk-1) then
delete c from Ck; // prune
end
end
return Ck;

29
An example
❖ F3 = {{1, 2, 3}, {1, 2, 4}, {1, 3, 4},
{1, 3, 5}, {2, 3, 4}}

❖ After join
▪ C4 = {{1, 2, 3, 4}, {1, 3, 4, 5}}

❖ After pruning:
▪ C4 = {{1, 2, 3, 4}}
because {1, 4, 5} is not in F3 ({1, 3, 4, 5} is removed)

30
The Apriori Algorithm — Example
Min support =50% = 2 trans
Database D
C1 L1

Scan D

C2 C2
L2 Scan D

C3 Scan D L3

31
Step 2: Generating rules from frequent itemsets

❖ Frequent itemsets ≠ association rules

❖ One more step is needed to generate association rules

❖ For each frequent itemset X,


For each proper nonempty subset A of X,
▪ Let B = X - A
▪ A → B is an association rule if
• Confidence(A → B) ≥ minconf,
support(A → B) = support(A∪B) = support(X)
confidence(A → B) = support(A ∪ B) / support(A)

32
Generating rules: an example
❖ Suppose {2,3,4} is frequent, with sup=50%
▪ Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4},
with sup=50%, 50%, 75%, 75%, 75%, 75% respectively
▪ These generate these association rules:
• 2,3 → 4, confidence=100%
• 2,4 → 3, confidence=100%
• 3,4 → 2, confidence=67%
• 2 → 3,4, confidence=67%
• 3 → 2,4, confidence=67%
• 4 → 2,3, confidence=67%
• All rules have support = 50%

33
Generating rules: summary

❖ To recap, in order to obtain A → B, we need to have


support (A ∪ B) and support (A)

❖ All the required information for confidence computation


has already been recorded in itemset generation. No
need to see the data T any more

❖ This step is not as time-consuming as frequent itemsets


generation

34
On Apriori Algorithm
Seems to be very expensive
❖ Level-wise search
❖ K = the size of the largest itemset
❖ It makes at most K passes over data
❖ In practice, K is bounded (10)
❖ The algorithm is very fast. Under some conditions, all
rules can be found in linear time
❖ Scale up to large data sets

35
Hash-tree: search

❖ Given a transaction T and a set Ck find all members of its


members contained in T

❖ Assume an ordering on the items

❖ Start from the root, use every item in T to go to the next


node

❖ If you are at an interior node and you just used item i, then
use each item that comes after i in T

❖ If you are at a leaf node check the itemsets

36
Methods to Improve Apriori’s Efficiency
❖ Transaction reduction: A transaction that does not contain any
frequent k-itemset is useless in subsequent scans

❖ Partitioning: Any itemset that is potentially frequent in DB must be


frequent in at least one of the partitions of DB

❖ Sampling: mining on a subset of given data, lower support


threshold + a method to determine the completeness

❖ Dynamic itemset counting: add new candidate itemsets only when


all of their subsets are estimated to be frequent

37
Is Apriori Fast Enough? — Performance Bottlenecks

❖ The core of the Apriori algorithm:


▪ Use frequent (k – 1)-itemsets to generate candidate frequent
k-itemsets
▪ Use database scan and pattern matching to collect counts for the
candidate itemsets

❖ The bottleneck of Apriori: candidate generation


▪ Huge candidate sets:
• 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2, …,
a100}, one needs to generate 2100 ≈ 1030 candidates.
▪ Multiple scans of database:
• Needs (n +1 ) scans, n is the length of the longest pattern

38
Max-Miner

❖ Max-miner finds long patterns efficiently: the maximal


frequent patterns

❖ Instead of checking all subsets of a long pattern, try to


detect long patterns early

❖ Scales linearly to the size of the patterns

39
Max-Miner: the idea

Set enumeration tree of


φ an ordered set

1 2 3 4

1,2 1,3 1,4 2,3 2,4


Pruning: (1) set infrequency
3,4
(2) Superset frequency
1,2,3 1,2,4 1,3,4 2,3,4
Each node is a candidate group g
h(g) is the head: the itemset of the node
1,2,3,4
t(g) tail: an ordered set that contains all
items that can appear in the subnodes

Example: h({1}) = {1} and t({1}) = {2,3,4}

40
Max-miner pruning

❖ When we count the support of a candidate group g, we


compute also the support for h(g), h(g) t(g) and h(g)
{i} for each i in t(g)

❖ If h(g) t(g) is frequent, then stop expanding the node g


and report the union as frequent itemset

❖ If h(g) {i} is infrequent, then remove i from all


sub-nodes (just remove i from any tail of a group after g)

❖ Expand the node g by one and do the same

41
The algorithm
Max-Miner
Set candidate groups C {}
Set of Itemsets F {Gen-Initial-Groups(T,C)}
while C not empty do
scan T to count the support of all candidate groups in C
for each g in C s.t. h(g) U t(g) is frequent do
F  F U {h(g) U t(g)}
Set candidate groups Cnew{ }
for each g in C such that h(g) U t(g) is infrequent do
F F U {Gen-sub-nodes(g, Cnew)}
C
remove from F any itemset with a proper superset in F
remove from C any group g s.t. h(g) U t(g) has a superset in F
return F

42
The algorithm (2)
Gen-Initial-Groups(T, C)
scan T to obtain F1, the set of frequent 1-itemsets
impose an ordering on items in F1
for each item i in F1 other than the greatest itemset do
let g be a new candidate with h(g) = {i}
and t(g) = {j | j follows i in the ordering}
C C U {g}
return the itemset F1 (an the C of course)

Gen-sub-nodes(g, C) /* generation of new itemsets at the next level*/


remove any item i from t(g) if h(g) U {i} is infrequent
reorder the items in t(g)
for each i in t(g) other than the greatest do
let g’ be a new candidate with h(g’) = h(g) U {i} and t(g’) = {j | j in t(g)
and j is after i in t(g)}
C  C U {g’}
return h(g) U {m} where m is the greatest item in t(g) or h(g) if t(g) is empty

43
Item Ordering

❖ Re-ordering items we try to increase the effectiveness of


frequency-pruning

❖ Very frequent items have higher probability to be


contained in long patterns

❖ Put these item at the end of the ordering, so they appear


in many tails

44
More on association rule mining
❖ Clearly the space of all association rules is exponential,
O(2m), where m is the number of items in I

❖ The mining exploits sparseness of data, and high


minimum support and high minimum confidence values

❖ Still, it always produces a huge number of rules,


thousands, tens of thousands, millions, ...

45
Contents
1 Introduction

2 Mining algorithms

3 Data formats for mining

4 Multiple minimum supports

5 Mining class association rules

6 Discussion & Exercises

46
3. Different data formats for mining
❖ The data can be in transaction form or table form

Transaction form: a, b
a, c, d, e
a, d, f

Table form: Attr1 Attr2 Attr3


a, b, d
b, c, e

❖ Table data need to be converted to transaction form for


association mining

47
Conversion
Table form: Attr1 Attr2 Attr3
a, b, d
b, c, e

⇒ Transaction form:
(Attr1, a), (Attr2, b), (Attr3, d)
(Attr1, b), (Attr2, c), (Attr3, e)

candidate-gen can be slightly improved. Why?

48
Contents
1 Introduction

2 Mining algorithms

3 Data formats for mining

4 Multiple minimum supports

5 Mining class association rules

6 Discussion & Exercises

49
4. Problems with the association mining
❖ Single minsup: It assumes that all items in the data are
of the same nature and/or have similar frequencies

❖ Not true: In many applications, some items appear


very frequently in the data, while others rarely appear

❖ E.g., in a supermarket, people buy food processor and


cooking pan much less frequently than they buy bread
and milk

50
Rare Item Problem
❖ If the frequencies of items vary a great deal, we will
encounter two problems:
▪ If minsup is set too high, those rules that involve rare
items will not be found
▪ To find rules that involve both frequent and rare items,
minsup has to be set very low. This may cause
combinatorial explosion because those frequent items
will be associated with one another in all possible ways

51
Multiple minsups model

❖ The minimum support of a rule is expressed in terms of


minimum item supports (MIS) of the items that appear
in the rule

❖ Each item can have a minimum item support

❖ By providing different MIS values for different items, the


user effectively expresses different support
requirements for different rules

52
Minsup of a rule
❖ Let MIS(i) be the MIS value of item i. The minsup of a rule
R is the lowest MIS value of the items in the rule

❖ I.e., a rule R: a1, a2, …, ak → ak+1, …, ar satisfies its


minimum support if its actual support is ≥
min(MIS(a1), MIS(a2), …, MIS(ar)).

53
An Example
❖ Consider the following items:
bread, shoes, clothes

The user-specified MIS values are as follows:


MIS(bread) = 2% MIS(shoes) = 0.1%
MIS(clothes) = 0.2%

The following rule doesn’t satisfy its minsup:


clothes → bread [sup=0.15%,conf =70%]

The following rule satisfies its minsup:


clothes → shoes [sup=0.15%,conf =70%]
54
Downward closure property
❖ In the new model, the property no longer holds (?)

❖ E.g., Consider four items 1, 2, 3 and 4 in a database.


Their minimum item supports are
MIS(1) = 10% MIS(2) = 20%
MIS(3) = 5% MIS(4) = 6%

{1, 2} with support 9% is infrequent, but {1, 2, 3} and


{1, 2, 4} could be frequent

55
Solution
❖ We sort all items in I according to their MIS values (make
it a total order)

❖ The order is used throughout the algorithm in each


itemset

❖ Each itemset w is of the following form:


{w[1], w[2], …, w[k]}, consisting of items,
w[1], w[2], …, w[k],
where MIS(w[1]) ≤ MIS(w[2]) ≤ … ≤ MIS(w[k]).

56
The MSapriori algorithm
Algorithm MSapriori(T, MS)
M ← sort(I, MS);
L ← init-pass(M, T);
F1 ← {{i} | i ∈ L, i.count/n ≥ MIS(i)};
for (k = 2; Fk-1 ≠ ∅; k++) do
if k=2 then
Ck ← level2-candidate-gen(L)
else Ck ← MScandidate-gen(Fk-1);
end;
for each transaction t ∈ T do
for each candidate c ∈ Ck do
if c is contained in t then
c.count++;
if c – {c[1]} is contained in t then
c.tailCount++
end
end
Fk ← {c ∈ Ck | c.count/n ≥ MIS(c[1])}
end
return F ← kFk;

57
Candidate itemset generation
❖ Special treatments needed:
▪ Sorting the items according to their MIS values
▪ First pass over data (the first three lines)
• Let us look at this in detail
▪ Candidate generation at level-2
• Read it in the handout
▪ Pruning step in level-k (k > 2) candidate generation
• Read it in the handout

58
First pass over data
❖ It makes a pass over the data to record the support
count of each item

❖ It then follows the sorted order to find the first item i in


M that meets MIS(i)
▪ i is inserted into L.
▪ For each subsequent item j in M after i, if j.count/n ≥
MIS(i) then j is also inserted into L, where j.count is
the support count of j and n is the total number of
transactions in T. Why?

❖ L is used by function level2-candidate-gen


59
First pass over data: an example
❖ Consider the four items 1, 2, 3 and 4 in a data set. Their
minimum item supports are:
MIS(1) = 10% MIS(2) = 20%
MIS(3) = 5% MIS(4) = 6%

❖ Assume our data set has 100 transactions. The first pass
gives us the following support counts:
{3}.count = 6, {4}.count = 3,
{1}.count = 9, {2}.count = 25

❖ Then L = {3, 1, 2}, and F1 = {{3}, {2}}


❖ Item 4 is not in L because 4.count/n < MIS(3) (= 5%),
❖ {1} is not in F1 because 1.count/n < MIS(1) (= 10%)
60
Rule generation
❖ The following two lines in MSapriori algorithm are
important for rule generation, which are not needed for
the Apriori algorithm:

if c – {c[1]} is contained in t then


c.tailCount++

❖ Many rules cannot be generated without them

❖ Why?

61
On multiple minsup rule mining
❖ Multiple minsup model subsumes the single support
model

❖ It is a more realistic model for practical applications

❖ The model enables us to found rare item rules yet


without producing a huge number of meaningless
rules with frequent items

❖ By setting MIS values of some items to 100% (or


more), we effectively instruct the algorithms not to
generate rules only involving these items
62
Contents
1 Introduction

2 Mining algorithms

3 Data formats for mining

4 Multiple minimum supports

5 Mining class association rules

6 Discussion & Exercises

63
5. Mining class association rules (CAR)
❖ Normal association rule mining does not have any target

❖ It finds all possible rules that exist in data, i.e., any item
can appear as a consequent or a condition of a rule

❖ However, in some applications, the user is interested in


some targets:
▪ E.g, the user has a set of text documents from some
known topics. He/she wants to find out what words are
associated or correlated with each topic.

64
Problem definition
❖ Let T be a transaction data set consisting of n transactions

❖ Each transaction is also labeled with a class y

❖ Let I be the set of all items in T, Y be the set of all class


labels and I ∩ Y = ∅

❖ A class association rule (CAR) is an implication of the


form
X → y, where X ⊆ I, and y ∈ Y

❖ The definitions of support and confidence are the same


as those for normal association rules

65
An example
❖ A text document data set
doc 1: Student, Teach, School : Education
doc 2: Student, School : Education
doc 3: Teach, School, City, Game : Education
doc 4: Baseball, Basketball : Sport
doc 5: Basketball, Player, Spectator : Sport
doc 6: Baseball, Coach, Game, Team : Sport
doc 7: Basketball, Team, City, Game : Sport

❖ Let minsup = 20% and minconf = 60%. The following are two
examples of class association rules:

Student, School → Education [sup= 2/7, conf = 2/2]


game → Sport [sup= 2/7, conf = 2/3]

66
Mining algorithm
❖ Unlike normal association rules, CARs can be mined
directly in one step

❖ The key operation is to find all ruleitems that have


support above minsup. A ruleitem is of the form:
(condset, y)

where condset is a set of items from I (i.e., condset ⊆


I), and y ∈ Y is a class label.

❖ Each ruleitem basically represents a rule:


condset → y,
❖ The Apriori algorithm can be modified to generate CARs
67
Multiple minimum class supports
❖ The multiple minimum support idea can also be applied here

❖ The user can specify different minimum supports to different


classes, which effectively assign a different minimum support
to rules of each class

❖ For example, we have a data set with two classes, Yes and
No. We may want
▪ rules of class Yes to have the minimum support of 5% and
▪ rules of class No to have the minimum support of 10%

❖ By setting minimum class supports to 100% (or more for some
classes), we tell the algorithm not to generate rules of those
classes

68
Contents
1 Introduction

2 Mining algorithms

3 Data formats for mining

4 Multiple minimum supports

5 Mining class association rules

6 Discussion & Exercises

69
Summary
❖ Association rule mining has been extensively studied in the
data mining community

❖ There are many efficient algorithms and model variations

❖ Other related work includes


▪ Multi-level or generalized rule mining
▪ Constrained rule mining
▪ Incremental rule mining
▪ Maximal frequent itemset mining
▪ Numeric association rule mining
▪ Rule interestingness and visualization
▪ Parallel algorithms
▪ …
70
Questions

71
Exercises

72
Exercises (2)

73
Main reference

74
Click to edit company slogan .

75

You might also like