0% found this document useful (0 votes)
7 views

Module1 Part2

Uploaded by

amvarshney123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module1 Part2

Uploaded by

amvarshney123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 17

Mining Association Rules

Association rule mining


 Proposed by Agrawal et al in 1993.
 It is an important data mining model studied
extensively by the database and data mining
community.
 Assume all data are categorical.
 Initially used for Market Basket Analysis to find
how items purchased by customers are related.

Bread  Milk [sup = 5%, conf = 100%]

2
The model: data

 I = {i1, i2, …, im}: a set of items.


 Transaction t :
 t a set of items, and t  I.

 Transaction Database T: a set of transactions


T = {t1, t2, …, tn}.

3
Transaction data: supermarket data
 Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
 Concepts:
 An item: an item/article in a basket
 I: the set of all items sold in the store
 A transaction: items purchased in a basket; it may
have TID (transaction ID)
 A transactional dataset: A set of transactions
4
The model: rules
 A transaction t contains X, a set of items
(itemset) in I, if X  t.
 An association rule is an implication of the
form:
X  Y, where X, Y  I, and X Y = 

 An itemset is a set of items.


 E.g., X = {milk, bread, cereal} is an itemset.
 A k-itemset is an itemset with k items.
 E.g., {milk, bread, cereal} is a 3-itemset

5
Rule strength measures
 Support: The rule holds with support sup in T
(the transaction data set) if sup% of
transactions contain X  Y.
 sup = Pr(X  Y).
 Confidence: The rule holds in T with
confidence conf if conf% of tranactions that
contain X also contain Y.
 conf = Pr(Y | X)
 An association rule is a pattern that states
when X occurs, Y occurs with certain
probability.
6
Support and Confidence
 Support count: The support count of an
itemset X, denoted by X.count, in a data set
T is the number of transactions in T that
contain X. Assume T has n transactions.
 Then,
( X  Y ).count
support 
n
( X  Y ).count
confidence 
X .count
7
Goal and key features
 Goal: Find all rules that satisfy the user-
specified minimum support (minsup) and
minimum confidence (minconf).
 Key Features
 Completeness: find all rules.
 No target item(s) on the right-hand-side
 Mining with data on hard disk (not in memory)

8
t1: Beef, Chicken, Milk
An example t2: Beef, Cheese
t3: Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese,
 Transaction data Milk
t6: Chicken, Clothes, Milk
 Assume: t7: Chicken, Milk, Clothes
minsup = 30%
minconf = 80%
 An example frequent itemset:
{Chicken, Clothes, Milk} [sup = 3/7]
 Association rules from the itemset:
Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]
… …
Clothes, Chicken  Milk, [sup = 3/7, conf = 3/3]

9
Transaction data representation
 A simplistic view of shopping baskets,
 Some important information not considered.
E.g,
 the quantity of each item purchased and
 the price paid.

10
Many mining algorithms
 There are a large number of them!!
 They use different strategies and data structures.
 Their resulting sets of rules are all the same.
 Given a transaction data set T, and a minimum support and
a minimum confident, the set of association rules existing in
T is uniquely determined.
 Any algorithm should find the same set of rules
although their computational efficiencies and
memory requirements may be different.
 We study only one: the Apriori Algorithm

11
The Apriori algorithm
 Probably the best known algorithm
 Two steps:
 Find all itemsets that have minimum support
(frequent itemsets, also called large itemsets).
 Use frequent itemsets to generate rules.

 E.g., a frequent itemset


{Chicken, Clothes, Milk} [sup = 3/7]
and one rule from the frequent itemset
Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]

12
Step 1: Mining all frequent
itemsets
 A frequent itemset is an itemset whose support

is ≥ minsup.
 Key idea: The apriori property (downward
closure property): any subsets of a frequent
itemset are also frequent itemsets
ABC ABD ACD BCD

AB AC AD BC BD CD

A B C D

13
The Algorithm
 Iterative algo. (also called level-wise search):
Find all 1-item frequent itemsets; then all 2-item
frequent itemsets, and so on.
 In each iteration k, only consider itemsets that

contain some k-1 frequent itemset.


 Find frequent itemsets of size 1: F1
 From k = 2
 Ck = candidates of size k: those itemsets of size k
that could be frequent, given Fk-1
 Fk = those itemsets that are actually frequent, Fk
 Ck (need to scan the database once).
14
Dataset T
TID Items
Example – minsup=0.5 T100 1, 3, 4
Finding frequent itemsets T200 2, 3, 5
T300 1, 2, 3, 5
T400 2, 5
itemset:count
1. scan T  C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
 F1: {1}:2, {2}:3, {3}:3, {5}:3

 C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}

2. scan T  C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2


 F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2

 C3: {2, 3,5}

3. scan T  C3: {2, 3, 5}:2  F3: {2, 3, 5}

15
Step 2: Generating rules from frequent
itemsets
 Frequent itemsets  association rules
 One more step is needed to generate
association rules
 For each frequent itemset X,
For each proper nonempty subset A of X,
 Let B = X - A
 A  B is an association rule if
 Confidence(A  B) ≥ minconf,

support(A  B) = support(AB) = support(X)


confidence(A  B) = support(A  B) / support(A)
16
Generating rules: an example
 Suppose {2,3,4} is frequent, with sup=50%
 Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with
sup=50%, 50%, 75%, 75%, 75%, 75% respectively
 These generate these association rules:
 2,3  4, confidence=100%
 2,4  3, confidence=100%
 3,4  2, confidence=67%
 2  3,4, confidence=67%
 3  2,4, confidence=67%
 4  2,3, confidence=67%
 All rules have support = 50%

17

You might also like