0% found this document useful (0 votes)
4 views

Module1 Part2

Uploaded by

amvarshney123
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Module1 Part2

Uploaded by

amvarshney123
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 17

Mining Association Rules

Association rule mining


 Proposed by Agrawal et al in 1993.
 It is an important data mining model studied
extensively by the database and data mining
community.
 Assume all data are categorical.
 Initially used for Market Basket Analysis to find
how items purchased by customers are related.

Bread  Milk [sup = 5%, conf = 100%]

2
The model: data

 I = {i1, i2, …, im}: a set of items.


 Transaction t :
 t a set of items, and t  I.

 Transaction Database T: a set of transactions


T = {t1, t2, …, tn}.

3
Transaction data: supermarket data
 Market basket transactions:
t1: {bread, cheese, milk}
t2: {apple, eggs, salt, yogurt}
… …
tn: {biscuit, eggs, milk}
 Concepts:
 An item: an item/article in a basket
 I: the set of all items sold in the store
 A transaction: items purchased in a basket; it may
have TID (transaction ID)
 A transactional dataset: A set of transactions
4
The model: rules
 A transaction t contains X, a set of items
(itemset) in I, if X  t.
 An association rule is an implication of the
form:
X  Y, where X, Y  I, and X Y = 

 An itemset is a set of items.


 E.g., X = {milk, bread, cereal} is an itemset.
 A k-itemset is an itemset with k items.
 E.g., {milk, bread, cereal} is a 3-itemset

5
Rule strength measures
 Support: The rule holds with support sup in T
(the transaction data set) if sup% of
transactions contain X  Y.
 sup = Pr(X  Y).
 Confidence: The rule holds in T with
confidence conf if conf% of tranactions that
contain X also contain Y.
 conf = Pr(Y | X)
 An association rule is a pattern that states
when X occurs, Y occurs with certain
probability.
6
Support and Confidence
 Support count: The support count of an
itemset X, denoted by X.count, in a data set
T is the number of transactions in T that
contain X. Assume T has n transactions.
 Then,
( X  Y ).count
support 
n
( X  Y ).count
confidence 
X .count
7
Goal and key features
 Goal: Find all rules that satisfy the user-
specified minimum support (minsup) and
minimum confidence (minconf).
 Key Features
 Completeness: find all rules.
 No target item(s) on the right-hand-side
 Mining with data on hard disk (not in memory)

8
t1: Beef, Chicken, Milk
An example t2: Beef, Cheese
t3: Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese,
 Transaction data Milk
t6: Chicken, Clothes, Milk
 Assume: t7: Chicken, Milk, Clothes
minsup = 30%
minconf = 80%
 An example frequent itemset:
{Chicken, Clothes, Milk} [sup = 3/7]
 Association rules from the itemset:
Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]
… …
Clothes, Chicken  Milk, [sup = 3/7, conf = 3/3]

9
Transaction data representation
 A simplistic view of shopping baskets,
 Some important information not considered.
E.g,
 the quantity of each item purchased and
 the price paid.

10
Many mining algorithms
 There are a large number of them!!
 They use different strategies and data structures.
 Their resulting sets of rules are all the same.
 Given a transaction data set T, and a minimum support and
a minimum confident, the set of association rules existing in
T is uniquely determined.
 Any algorithm should find the same set of rules
although their computational efficiencies and
memory requirements may be different.
 We study only one: the Apriori Algorithm

11
The Apriori algorithm
 Probably the best known algorithm
 Two steps:
 Find all itemsets that have minimum support
(frequent itemsets, also called large itemsets).
 Use frequent itemsets to generate rules.

 E.g., a frequent itemset


{Chicken, Clothes, Milk} [sup = 3/7]
and one rule from the frequent itemset
Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]

12
Step 1: Mining all frequent
itemsets
 A frequent itemset is an itemset whose support

is ≥ minsup.
 Key idea: The apriori property (downward
closure property): any subsets of a frequent
itemset are also frequent itemsets
ABC ABD ACD BCD

AB AC AD BC BD CD

A B C D

13
The Algorithm
 Iterative algo. (also called level-wise search):
Find all 1-item frequent itemsets; then all 2-item
frequent itemsets, and so on.
 In each iteration k, only consider itemsets that

contain some k-1 frequent itemset.


 Find frequent itemsets of size 1: F1
 From k = 2
 Ck = candidates of size k: those itemsets of size k
that could be frequent, given Fk-1
 Fk = those itemsets that are actually frequent, Fk
 Ck (need to scan the database once).
14
Dataset T
TID Items
Example – minsup=0.5 T100 1, 3, 4
Finding frequent itemsets T200 2, 3, 5
T300 1, 2, 3, 5
T400 2, 5
itemset:count
1. scan T  C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
 F1: {1}:2, {2}:3, {3}:3, {5}:3

 C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}

2. scan T  C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2


 F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2

 C3: {2, 3,5}

3. scan T  C3: {2, 3, 5}:2  F3: {2, 3, 5}

15
Step 2: Generating rules from frequent
itemsets
 Frequent itemsets  association rules
 One more step is needed to generate
association rules
 For each frequent itemset X,
For each proper nonempty subset A of X,
 Let B = X - A
 A  B is an association rule if
 Confidence(A  B) ≥ minconf,

support(A  B) = support(AB) = support(X)


confidence(A  B) = support(A  B) / support(A)
16
Generating rules: an example
 Suppose {2,3,4} is frequent, with sup=50%
 Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with
sup=50%, 50%, 75%, 75%, 75%, 75% respectively
 These generate these association rules:
 2,3  4, confidence=100%
 2,4  3, confidence=100%
 3,4  2, confidence=67%
 2  3,4, confidence=67%
 3  2,4, confidence=67%
 4  2,3, confidence=67%
 All rules have support = 50%

17

You might also like