0% found this document useful (0 votes)
79 views

04 Frequent Patterns Analysis

This document discusses frequent pattern mining and association rule learning. It begins by defining frequent patterns as combinations of items that occur frequently together in a dataset. It then provides examples of frequent itemsets and association rules that can be mined from market basket transaction data. The rest of the document discusses key algorithms and concepts for mining frequent itemsets and generating association rules, including the Apriori algorithm which uses the anti-monotonicity property to prune the search space.

Uploaded by

aanaon0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

04 Frequent Patterns Analysis

This document discusses frequent pattern mining and association rule learning. It begins by defining frequent patterns as combinations of items that occur frequently together in a dataset. It then provides examples of frequent itemsets and association rules that can be mined from market basket transaction data. The rest of the document discusses key algorithms and concepts for mining frequent itemsets and generating association rules, including the Apriori algorithm which uses the anti-monotonicity property to prune the search space.

Uploaded by

aanaon0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 37

Data Mining

Lecture 4:
Frequent Patterns Analysis

Data Mining 1
Frequent Itemsets

 Given a set of transactions, find combinations of items


(itemsets) that occur frequently.
 Transaction : it is a set of items.
 Frequent pattern : a pattern (a set of items) that occurs
frequently in a data set.
Items: {Bread, Milk, Diaper, Coffee, Eggs, Coke}
TID Items Examples of frequent itemsets

1 Bread, Milk {Bread}: 4, {Milk} : 4, {Coffee}: 3,


{Diaper, Coffee} : 3 {Milk, Bread} : 3
2 Bread, Diaper, Coffee, Eggs
3 Milk, Diaper, Coffee, Coke Example of Association Rules
4 Bread, Milk, Diaper, Coffee {Diaper}  {Coffee},
{Milk, Bread}  {Eggs, Coke},
5 Bread, Milk, Diaper, Coke
{Coffee, Bread}  {Milk},
Market-Basket transactions
Data Mining 2
Applications
 Sets of products someone bought in one trip to the store.
 Given that many people buy coffee and sugar together:
 Run a sale on coffee; raise price of sugar.
 Only useful if many buy coffee & sugar.
 Words in different web pages.
 Unusual words appearing together in a large number of
documents, e.g., “Brad” and “Angelina,” may indicate an
interesting relationship.
 Documents containing sentences.
 Items that appear together too often could represent
plagiarism.

Data Mining 3
Definition: Frequent Itemset

TID Items
1 Bread, Milk
2 Bread, Diaper, Coffee, Eggs
3 Milk, Diaper, Coffee, Coke
4 Bread, Milk, Diaper, Coffee
5 Bread, Milk, Diaper, Coke

4
Definition: Association Rule
 Association Rule:
TID Items
– An implication expression of the form
X  Y, where X and Y are itemsets. 1 Bread, Milk
– Example: 2 Bread, Diaper, Coffee, Eggs
{Milk, Diaper}  {Coffee} 3 Milk, Diaper, Coffee, Coke
4 Bread, Milk, Diaper, Coffee
 Rule Evaluation Metrics:
– Support (s): 5 Bread, Milk, Diaper, Coke
 Fraction of transactions that contain
both X and Y. Example:
– Confidence (c):
{Milk, Diaper}  Coffee
 Measures how often items in Y
appear in transactions that
contain X.  (Milk, Diaper, Coffee) 2
s   0.4
|T| 5
 (Milk, Diaper, Coffee) 2
c   0.67
 (Milk, Diaper) 3

5
Association Rule
 Input: set of transactions T, over a set of items I.
 Output: All itemsets with items in I having:
 Support ≥ minsup threshold
 Find all the rules X Y with minimum support and confidence:
 Support (s) is probability that a transaction contains X U Y.
 s = P(X U Y) = support count (X U Y) / number of all transactions

 Confidence (c) is conditional probability that a transaction


having X also contains Y.
 c = P(X|Y) = support count (X U Y) / support count (X)

Data Mining 6
Example
Tid Items bought Customer buys both Customer
10 Juice, Nuts, Diaper buys diaper
20 Juice, Coffee, Diaper
30 Juice, Diaper, Eggs
40 Nuts, Eggs, Milk
Customer
50 Nuts, Coffee, Diaper, Eggs, Milk
buys Coffee

 Let minsup = 50%, minconf = 50%:


 Number of all transactions = 5  Min. support count = 5 * 50%= 2.5 ⇒ 3.
 Items: Juice, Nuts, Diaper, Coffee, Eggs, Milk.
 Freq. Itemset: {Juice}:3, {Nuts}:3, {Diaper}:4, {Eggs}:3, {Juice,
Diaper}:3
 Association rules (support, confidence):
 Juice  Diaper (3/5, 3/3)  (60%, 100%).
 Diaper  Juice (3/5, 3/4)  (60%, 75%). 7
Data Mining
Mining Association Rules
TID Items
Example of Rules:
1 Bread, Milk
2 Bread, Diaper, Coffee, Eggs
{Milk,Diaper}  {Coffee} (s=0.4, c=0.67)
3 Milk, Diaper, Coffee, Coke {Milk,Coffee}  {Diaper} (s=0.4, c=1.0)
4 Bread, Milk, Diaper, Coffee {Diaper,Coffee}  {Milk} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke {Coffee}  {Milk,Diaper} (s=0.4, c=0.67)

{Diaper}  {Milk,Coffee} (s=0.4, c=0.5)


Observations: {Milk}  {Diaper,Coffee} (s=0.4, c=0.5)
•All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Coffee}
•Rules originating from the same itemset have identical support but
can have different confidence.
•Thus, we may decouple the support and confidence requirements.
Data Mining 8
Mining Association Rules
 Two-step approach:
1. Frequent Itemset Generation:
 Generate all itemsets whose support  minsup.
2. Rule Generation:
 Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of
a frequent itemset such that its confidence 
minconf.
 Frequent itemset generation is still computationally expensive.

Frequent itemset: {A,B,C,D}


Rule: ABCD
Data Mining 9
Frequent Itemset Generation
null
Representation of all possible
itemsets and their relationships

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Given d items, there are


2d possible itemsets.
ABCDE Too expensive to test all!
Data Mining 10
Frequent Itemset Generation

 Brute-force approach:
 Each itemset in the lattice is a candidate frequent itemset.

 Count the support of each candidate by scanning the

database.

 Expensive since M = 2d !!!


Transactions List of
TID Items Candidates
TID Items
1 Bread, Milk
1 Bread, Milk
2
2 Bread,
Bread, Diaper,
Diaper, Coffee,
Beer, EggsEggs
N
3
3 Milk,
Milk, Diaper,
Diaper, Coffee,
Beer, CokeCoke M
4
4 Bread,
Bread, Milk,
Milk, Diaper,
Diaper, Coffee
Beer
5 Bread,
Bread,Milk,
Milk,Diaper, Coke
Diaper, Coke
w
Data Mining 11
The Apriori Principle
• Apriori principle (Main observation):
– If an itemset is frequent, then all of its subsets must also be
frequent.
– If an itemset is not frequent, then all of its supersets cannot be
frequent.
– If {Coffee, diaper, nuts} is frequent, so is {Coffee, diaper}.
– i.e., every transaction having {Coffee, diaper, nuts} also
contains {Coffee, diaper}.

– The support of an itemset never exceeds the support of its


subsets.
– This is known as the anti-monotone property of support.

Data Mining 12
Illustration of the Apriori principle

Frequent
subsets

Found to be frequent

Data Mining 13
Illustration of the Apriori principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Infrequent supersets
ABCDE
Pruned
Data Mining 14
Illustration of the Apriori principle
TID Items
1 Bread, Milk
minsup = 3 2 Bread, Diaper, Coffee, Eggs
3 Milk, Diaper, Coffee, Coke
Item Count Items (1-itemsets) 4 Bread, Milk, Diaper, Coffee
Bread 4 5 Bread, Milk, Diaper, Coke
Coke 2 Itemset Count Pairs (2-itemsets)
Milk 4 {Bread,Milk} 3
Coffee 3 {Bread,Coffee} 2 (No need to generate candidates
Diaper 4 {Bread,Diaper} 3
Eggs 1 involving Coke or Eggs).
{Milk,Coffee} 2
{Milk,Diaper} 3
{Coffee,Diaper} 3
Triplets (3-itemsets)
If every subset is considered: Itemset Count
26 = 64 {Bread,Milk,Diaper} 2
With support-based pruning:
(No need to generate candidates involving
6 + 6 + 1 = 13
{Bread, coffee} or {Milk, coffee}).

This triplet is below the minsup threshold.


Data Mining 15
Apriori Algorithm
 Method:
 Let k=1

 Generate frequent itemsets of length 1.

 Repeat until no new frequent itemsets are identified.

 Generate length (k+1) candidate itemsets from

length k frequent itemsets.


 Prune candidate itemsets containing subsets of

length k that are infrequent.


 Count the support of each candidate by scanning

the DB.
 Eliminate candidates that are infrequent, leaving

only those that are frequent.

Data Mining 16
Important Details of Apriori
 How to generate candidates? Ck = candidate itemsets of size k
 Step 1: self-joining Lk Lk = frequent itemsets of size k
 Join any two itemsets from Lk if they share the same (k-1) prefix (i.e. differ by
last item only)
 Step 2: pruning (omitted in most implementations)
 Prune any itemset from Ck+1 if any of its k-itemset subsets is not in Lk
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3* L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4={abcd}
Data Mining 17
Example: Generate Candidates Ck+1
{a,b,c} {a,b,d}

• L3={abc, abd, acd, ace, bcd}


{a,b,c,d}
• Self-joining: L3*L3
– abcd from abc and abd abc ab acdbcd
   
– acde from acd and ace d
• Pruning: {a,c,d} {a,c,e}

– abcd is kept since all subset itemsets are


{a,c,d,e}
in L3

– acde is removed because ade is not in L3


acd ace ade cde
  X
• C4={abcd}
Data Mining 18
The Apriori Algorithm: Example (1)
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
Data Mining 19
The Apriori Algorithm: Example (2)

minsupp = 2
TID Items F1-ISs F2-ISs F3-ISs F4-ISs
1 {A,B} A (7) AB (6) ABC (4) ABCD (2)
2 {B,C,D} B (9) AC (4) ABD (3) BCDE (1)
3 {A,B,C,D,E} C (7) AD (4) ABE (1)
Save frequents D (5) AE (2) ACD (2)
4 {A,D,E}
along with their E (3) BC (7) ACE (1)
5 {A,B,C} BD (4) ADE (2)
supports for
6 {A,B,C,D} later !!! BE (2) BCD (3)
7 {B,C} CD (3) BCE (2)
8 {A,B,C} CE (2) BDE (1)
9 {A,B,D} DE (2) CDE (1)
10 {B,C,E}

Data Mining 20
Rule Generation

 We have all frequent itemsets, how do we get the rules?


 For every frequent itemset S, we find rules of the form L  S – L
where L  S, that satisfy the minimum confidence requirement.
 Example: S = {A,B,C,D}
 Candidate rules:
 A BCD, B ACD, C ABD, D ABC,
 AB CD, AC  BD, AD  BC, BC AD, BD AC, CD AB,


ABC D, ABD C, ACD B, BCD A.
 If |S| = k, then there are 2k – 2 candidate association rules
(ignoring S   and   S).

Data Mining 21
Different- Colored Cellular Phone
Faceplates

Transactio Faceplate Colors Purchased


n
1 Red, white, green
2 White, orange
3 White, blue
4 Red, white, orange
5 Red, blue
6 White, blue
7 White, orange
8 Red, white, blue, green
9 Red, white, blue
10 Yellow

Data Mining 22
Phone Faceplate Data in Binary
Matrix Format

T Red white Blue orange Green Yellow


1 1 1 0 0 1 0
2 0 1 0 1 0 0
3 0 1 1 0 0 0
4 1 1 0 1 0 0
5 1 0 1 0 0 0
6 0 1 1 0 0 0
7 0 1 0 1 0 0
8 1 1 1 0 1 0
9 1 1 1 0 0 0
10 0 0 0 0 0 1

Data Mining 23
Item Sets with Support Count of At
Least Two (20%)

Itemset Support Itemset Support


{red} 5 {red, green} 2
{white} 8 {white, blue} 4
{blue} 5 {white, orange} 3
{orange} 3 {white, green} 2
{green} 2 {red, white, blue} 2
{red, 4 {red, white, green} 2
white}
{red, blue} 3

Data Mining 24
Generating Association Rule
 For itemset {red, white, green}.
 Rule 1: {red, white} => {green},
 conf = sup {red, white, green} / sup {red, white} = 2/4 = 50%
 Rule 2: {red, green} => {white},
 conf = sup {red, white, green} / sup {red, green} = 2/2 = 100%
 Rule 3: {white, green} => {red},
 conf = sup {red, white, green} / sup {white, green} = 2/2 = 100%
 Rule 4: {red} =>{white, green},
 conf = sup {red, white, green} / sup {red} = 2/6 = 33%
 Rule 5: {white} => {red, green},
 conf = sup {red, white, green} / sup {white} = 2/7 = 29%
 Rule 6: {green} => {red, white}
 conf = sup {red, white, green} / sup {green} = 2/2 = 100%
 If the desired min_conf is 70%, we got Rule 2, 3, 6.

Data Mining 25
Final Results for Phone Faceplate
Transactions

Rule Conf. X Y Supp. Supp. Supp.(XUY)


# % (X) (Y)
1 100 Green Red, White 2 (20%) 4 (40%) 2 (20%)
2 100 Green Red 2 (20%) 6 (60%) 2 (20%)
3 100 Green, White Red 2 (20%) 6 (60%) 2 (20%)
4 100 Green White 2 (20%) 7 (70%) 2 (20%)
5 100 Green, Red White 2 (20%) 7 (70%) 2 (20%)
6 100 Orange White 2 (20%) 7 (70%) 2 (20%)

Data Mining 26
Example (3):
• Use Apriori to generate frequent itemsets for the following
transaction database:
Let: min sup = 60% and min conf = 80%.

TID Items-bought
T100 {F, A, C, D, G, I, M, P}
T200 {A, B, C, F, L, M, O}
T300 {B, F, H, J, O, W}
T400 {B, C, K, S, P}
T500 {A, F, C, E, L, P, M, N}

Data Mining 27
C1 L1 C2 L2 C3
A 3 A 3 AB 1 AC 3 ACF 3
B 3 B 3 AC 3 AF 3 ACM 3
C 4 C 4 AF 3 AM 3 AFM 3
D 1 F 4 AM 3 CF 3 CFM 3
E 1 M 3 AP 2 CM 3 CFP 2
F 4 P 3 BC 2 CP 3 CMP 2
G 1 BF 2 FM 3
H 1 BM 1
I 1 BP 1 C4
J 1 CF 3 ACFM 3
K 1 CM 3 L3
L 2 CP 3 ACF 3
M 3 FM 3 ACM 3 L4
N 1 FP 2 AFM 3 ACFM 3
O 2 MP 2 CFM 3
P 3 C5 = 
S 1
W 1

Data Mining 28
• PHASE 2 OF APRIORI:
• For every frequent itemset L, we find all its proper subsets
and create the association rules as shown in the next
example:

• Let L be {A, C, F, M}

• Then proper subsets of L:


S1=A , S2=C S3=F S4=M
S5=AC S6=AF S7=AM S8=CF S9=CM S10=FM
S11=ACF S12=ACM S13=AFM S14=CFM

Rx: SxL - Sx
CONF(Rx) = SUPPORT(L) / SUPPORT(Sx)

Data Mining 29
R1:S1L - S1
ACFM
CONF(R1)= 3/3=100% > 80% STRONG

R2:S2L-S2
CAFM
CONF(R2)=3/3=100% > 80% STRONG

R3:S3L-S3
F ACM
CONF(R3)=3/3=100% >80% STRONG

R4:S4L-S4
MACF
CONF(R4)=3/3=100% >80% STRONG

R5:S5L-S5
ACFM
CONF(R5)=3/3=100% >80% STRONG
Data Mining 30
R6:S6L-S6
AFCM
CONF(R6)=3/3>100% >80% STRONG

R7:S7L-S7
AMCF
CONF(R7)=3/3=100 % >80% STRONG

R8:S8L-S8
CFAC
CONF(R8)=3/3=100 % >80% STRONG

R9:S9L-S9
CMAF
CONF(R9)=3/3=100 % >80% STRONG

R10:S10L-S10
FMAC
CONF(R10)=3/3=100 % >80% STRONG
Data Mining 31
R11:S11L-S11
ACFM
CONF(R11)=3/3=100 % >80% STRONG

R12:S12L-S12
ACMF
CONF(R12)=3/4=75 % < 80% NOT STRONG

R13:S13L-S13
AFMC
CONF(R12)=3/4=75 % < 80 % NOT STRONG

R14:S14L-S14
CFMA
CONF(R14)=3/3=100 % >80% STRONG

Data Mining 32
Example (4):

Use Apriori to generate frequent itemsets for the following transaction database:
Let min sup = 20% and min conf = 70%.

Data Mining 33
Data Mining 34
Generating association rules from
frequent itemsets

Data Mining 35
Data Mining 36
Example (5):

Use Apriori to generate frequent itemsets for the following transaction database:
Let min sup = 60% and min conf = 80%.

TID Items-Bought
T100 E, K, M, N, O, Y
T200 D, E, K, N, O, Y
T300 A, E, K, M
T400 C, K, M, U, Y
T500 C, E, I, K,O

Data Mining 37

You might also like