0% found this document useful (0 votes)
42 views47 pages

Chapter06 (Frequent Patterns)

Uploaded by

jozef jostar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views47 pages

Chapter06 (Frequent Patterns)

Uploaded by

jozef jostar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Data Mining:

Concepts and Techniques


(3rd ed.)

— Chapter 6 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2013 Han, Kamber & Pei. All rights reserved.
1
Chapter 6: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

2
What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
3
8/22/21 Data Mining: Concepts and Techniques 4
Why Is Freq. Pattern Mining Important?
 Freq. pattern: An intrinsic and important property of
datasets
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis

 Sequential, structural (e.g., sub-graph) patterns

 Pattern analysis in spatiotemporal, multimedia, time-

series, and stream data


 Classification: discriminative, frequent pattern analysis

 Cluster analysis: frequent pattern-based clustering

 Data warehousing: iceberg cube and cube-gradient

 Semantic data compression: fascicles

 Broad applications

5
Basic Concepts: Frequent Patterns

Tid Items bought  itemset: A set of one or more


10 Beer, Nuts, Diaper items
20 Beer, Coffee, Diaper  k-itemset X = {x1, …, xk}
30 Beer, Diaper, Eggs  (absolute) support, or, support
40 Nuts, Eggs, Milk count of X: Frequency or
50 Nuts, Coffee, Diaper, Eggs, Milk occurrence of an itemset X
Customer Customer
 (relative) support, s, is the
buys both buys diaper fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
 An itemset X is frequent if X’s
support is no less than a minsup
Customer threshold
buys beer

6
Basic Concepts: Association Rules
Tid Items bought  Find all the rules X  Y with
10 Butter, Nuts, Diaper
minimum support and confidence
20 Butter, Coffee, Diaper
30 Butter, Diaper, Eggs
 support, s, probability that a
40 Nuts, Eggs, Milk transaction contains X  Y
50 Nuts, Coffee, Diaper, Eggs, Milk
 confidence, c, conditional
Customer
buys both
Customer probability that a transaction
buys
having X also contains Y
diaper
Let minsup = 50%, minconf = 50%
Freq. Pat.: Butter:3, Nuts:3, Diaper:4,
Eggs:3, {Butter, Diaper}:3
Customer
buys beer  Association rules: (many more!)
 Butter  Diaper (60%, 100%)
 Diaper  Butter (60%, 75%)
7
Interesting association rules

 P(B|A) = P(AUB) / P(A)


8/22/21 Data Mining: Concepts and Techniques 8


Association rule mining
 Given a transaction database and minsup and
minconf thresholds, compute all association rules
that satisfy minsup and minconf requirements
 Steps
 Find all frequent itemsets

 Generate association rules from frequent

itemsets which satisfy minimum confidence

8/22/21 Data Mining: Concepts and Techniques 9


Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

10
Scalable Frequent Itemset Mining Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical

Data Format
11
The Downward Closure Property and Scalable
Mining Methods
 The downward closure property of frequent patterns
 Any subset of a frequent itemset must be frequent

 If {beer, diaper, nuts} is frequent, so is {beer,

diaper}
 i.e., every transaction having {beer, diaper, nuts} also

contains {beer, diaper}


 Scalable mining methods: Three major approaches
 Apriori

 Freq. pattern growth

 Vertical data format approach

12
Apriori: A Candidate Generation & Test Approach

 Apriori pruning principle: If there is any itemset which is


infrequent, its superset should not be generated/tested!
 Apriori Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length k
frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be
generated

13
The Apriori Algorithm—An Example
to generate all frequent itemsets
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
C4 = { }. Algorithm
{B, C, E} {B, C, E} 2
terminates
14
What are the association rules for the previous
candidate itemset?
 Steps
 Find all non-empty subsets

 Generate rules and find the confidence for

each rule
 Select all rules that satisfy min.confidence

 These rules are the strong association rules


 Example: For the itemset (B,C,E}
 Subsets: {B,C}, (B,E}, {C,E}, {B}, {C}, {E}
 There will be six rules. For ex: {B,C}=>{E} etc

8/22/21 Data Mining: Concepts and Techniques 15


Cond…

Tid Items
 {B,C}=>{E}
 Conf = 2/2 = 100%
10 A, C, D
 Similarly, find conf for other
20 B, C, E rules
30 A, B, C, E  Select those rules which
40 B, E
satisfy minconf
 Suppose, minconf = 60%.
What are the strong rules
that you can select?

8/22/21 Data Mining: Concepts and Techniques 16


Implementation of Apriori
 How to generate candidates?
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 (First K-2 items should be common)
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 If any subset is infrequent, the itemset will also be infrequent
 acde is removed because ade is not in L3
 C4 = {abcd}

17
Example 6.3

MinSupport = 2

8/22/21 Data Mining: Concepts and Techniques 18


C4 = {} and algorithm terminates. L3 contains all frequent
itemsets
8/22/21 Data Mining: Concepts and Techniques 19
Calculation of candidate 3-itemsets

8/22/21 Data Mining: Concepts and Techniques 20


Rules for Table 6.1

8/22/21 Data Mining: Concepts and Techniques 21


The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that are
contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk; 22
Scalable Frequent Itemset Mining Methods

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data Format

 Mining Close Frequent Patterns and Maxpatterns

23
Exercise
 Find all frequent itemsets using Apriori algorithm
and generate all association rules (assume
minsup = 20%, minconf=50%)

8/22/21 Data Mining: Concepts and Techniques 24


Further Improvement of the Apriori Method

 Major computational challenges


 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates

25
Partition: Scan Database Only Twice
 Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
 Scan 1: partition database and find local frequent

patterns
 Scan 2: consolidate global frequent patterns

DB1 + DB2 + + DBk = DB


sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB
Sampling for Frequent Patterns

 Select a sample of original database, mine frequent


patterns within sample using Apriori
 Scan database again to find missed frequent patterns

27
Bottleneck of Frequent-pattern Mining

 Multiple database scans are costly


 Mining long patterns needs many passes of
scanning and generates lots of candidates
 To find frequent itemset i1i2…i100
 # of scans: 100
 # of Candidates: (1100) + (2100) + … + (110000) = 2100-1
= 1.27*1030 !
 Bottleneck: candidate-generation-and-test
 Can we avoid candidate generation?

8/22/21 Data Mining: Concepts and Techniques 28


Pattern-Growth Approach: Mining Frequent Patterns
Without Candidate Generation
 Bottlenecks of the Apriori approach
 Huge Candidate generation and test
 The FPGrowth Approach
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones using local
frequent items only

29
Construct FP-tree from a Transaction Database

1. Scan DB once, find frequent 1-itemset (single item


pattern)
2. Sort frequent items in frequency descending order, f-
list
3. Scan DB again, construct FP-tree

F-list = f-c-a-b-m-p
30
Example 6.3

8/22/21 Data Mining: Concepts and Techniques 31


Cond…

8/22/21 Data Mining: Concepts and Techniques 32


8/22/21 Data Mining: Concepts and Techniques 33
8/22/21 Data Mining: Concepts and Techniques 34
8/22/21 Data Mining: Concepts and Techniques 35
8/22/21 Data Mining: Concepts and Techniques 36
Find Patterns Having P From P-conditional Database

 Starting at the frequent item header table in the FP-tree


 Traverse the FP-tree by following the link of each frequent item p
 Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base

37
From Conditional Pattern-bases to Conditional FP-trees

 For each pattern-base


 Accumulate the count for each item in the base

 Construct the FP-tree for the frequent items of the

pattern base

38
Benefits of the FP-tree Structure

 Completeness
 Preserve complete information for frequent pattern
mining
 Compactness
 Reduce irrelevant info—infrequent items are gone
 No candidate generation, no candidate test
 Compressed database: FP-tree structure
 No repeated scan of entire database

39
Scalable Frequent Itemset Mining Methods

 Apriori: A Candidate Generation-and-Test Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth Approach

 ECLAT: Frequent Pattern Mining with Vertical Data Format

 Mining Close Frequent Patterns and Maxpatterns

40
CHARM: Mining by Exploring Vertical
Data Format
 Horizontal date format
 Transaction-id: Itemset format

 Vertical data format


 Item: set of Transaction-id format

 Explained in next slide

8/22/21 Data Mining: Concepts and Techniques 41


CHARM: Mining by Exploring Vertical Data Format
Cond…

8/22/21 Data Mining: Concepts and Techniques 43


Chapter 5: Mining Frequent Patterns, Association and
Correlations: Basic Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

44
Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
 Measure of dependent/correlated events: lift

P( A B) Basketball Not basketball Sum (row)


lift  Cereal 2000 1750 3750
P( A) P ( B )
Not cereal 1000 250 1250
2000 / 5000
lift ( B, C )   0.89 Sum(col.) 3000 2000 5000
3000 / 5000 * 3750 / 5000
1000 / 5000
lift ( B, C )   1.33
3000 / 5000 *1250 / 5000
B-basketball
C-cereal
45
Are lift and 2 Good Measures of Correlation?

 “Buy walnuts  buy


milk [1%, 80%]” is
misleading if 85% of
customers buy milk
 Support and confidence
are not good to indicate
correlations
 Over 20 interestingness
measures have been
proposed (see Tan,
Kumar, Sritastava
@KDD’02)
 Which are good ones?

46
Summary

 Basic concepts: association rules, support-


confident framework
 Scalable frequent pattern mining methods
 Apriori (Candidate generation & test)
 Projection-based (FPgrowth, CLOSET+, ...)
 Vertical format approach (ECLAT, CHARM, ...)
 Which patterns are interesting?
 Pattern evaluation methods

47

You might also like