0% found this document useful (0 votes)
11 views36 pages

Mining Frequent Pattern

Frequent Pattern Analysis involves identifying patterns that occur frequently within datasets, with applications in various fields such as market basket analysis and DNA sequence analysis. Association Rule Mining is a key technique used to discover relationships between items in transactions, focusing on support and confidence metrics to evaluate the strength of these rules. The document discusses the importance of frequent pattern mining, various interpretations of transaction data, and the methodologies for mining association rules, including the Apriori algorithm.

Uploaded by

man
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views36 pages

Mining Frequent Pattern

Frequent Pattern Analysis involves identifying patterns that occur frequently within datasets, with applications in various fields such as market basket analysis and DNA sequence analysis. Association Rule Mining is a key technique used to discover relationships between items in transactions, focusing on support and confidence metrics to evaluate the strength of these rules. The document discusses the importance of frequent pattern mining, various interpretations of transaction data, and the methodologies for mining association rules, including the Apriori algorithm.

Uploaded by

man
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

Mining Frequent Pattern

Asma Kanwal
Lecturer
What Is Frequent Pattern Analysis?

 Frequent pattern: a pattern (a set of items, subsequences, substructures,


etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and diapers?!

 What are the subsequent purchases after buying a PC?

 What kinds of DNA are sensitive to this new drug?

 Can we automatically classify web documents?

 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
Why Is Freq. Pattern Mining Important?

 Discloses an intrinsic and important property of data sets


 Forms the foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
 Classification: associative classification
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications
Association Rule Mining
 Given a set of transactions, find rules that will predict the occurrence
of an item based on the occurrences of other items in the transaction

Market-Basket transactions
Example of Association Rules
TID Items
{Diaper}  {Beer},
1 Bread, Milk {Milk, Bread}  {Eggs,Coke},
2 Bread, Diaper, Beer, Eggs {Beer, Bread}  {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Transaction data can be broadly interpreted I:
A set of documents…
• A text document data set. Each document is treated as a “bag” of
keywords. Note, text is ordered, but bags of word are not ordered

doc1: Student, Teach, School


doc2: Student, School
doc3: Teach, School, City, Game
doc4: Baseball, Basketball Example of Association Rules
doc5: Basketball, Player, Spectator
{Student}  {School},
doc6: Baseball, Coach, Game, Team {data}  {mining},
doc7: Basketball, Team, City, Game {Baseball}  {ball},
Transaction data can be broadly interpreted II:
A set of genes
ID Expressed Genes in Sample
1 GENE1, GENE2, GENE 5
2 GENE1, GENE3, GENE 5
3 GENE2
4 GENE8, GENE9
5 GENE8, GENE9, GENE10
6 GENE2, GENE8
Example of Association Rules
7 GENE9, GENE10
8 GENE2
{GENE1}  {GENE12},
9 GENE11 {GENE3, GENE12} 
{GENE3},
Transaction data can be broadly interpreted
III: A set of time series patterns

1 A
B

2
A
C
Example of Association Rules
3 D
C {A}  {B}
4
A
A

0 120 0 180
Use of Association Rules
 Association rules do not represent any sort of causality or
correlation between the two itemsets.
 X  Y does not mean X causes Y, so no Causality
 X  Y can be different from Y  X, unlike correlation

 Association rule types:


 Actionable Rules – contain high-quality, actionable information
 Trivial Rules – information already well-known by those familiar with
the domain
 Inexplicable Rules – no explanation and do not suggest action

 Trivial and Inexplicable Rules occur most often


The Ideal Association Rule
 Imagine that we have a large transaction dataset of patient
symptoms and interventions (including drugs taken).

 We run our algorithm and it gives a rule that reads:

{warfarin, levofloxacin }  {nose bleeds }

 Then we have automatically discovered a dangerous drug


interaction. Both warfarin and levofloxacin are useful drugs by
themselves, but together they are dangerous… patterns of
bruises. Signs of an active bleed include: coughing up blood in
the form of coffee grinds (hemoptysis), gingival bleeding, nose
bleeds,….
Intuitive Association Rules
 In the music recommendation domain:
{purchased(beatles LP)}  {purchased(the kinks LP)}
 These kinds of rules are very exploitable in ecommerce.
Definition: Frequent Itemset
 Itemset
 A collection of one or more items
 Example: {Milk, Bread, Diaper} TID Items
 k-itemset 1 Bread, Milk
 An itemset that contains k items 2 Bread, Diaper, Beer, Eggs
 Support count () 3 Milk, Diaper, Beer, Coke
4 Bread, Milk , Beer, Diaper
 Frequency of occurrence of an
itemset 5 Bread, Milk, Diaper, Coke
 E.g. ({Milk, Bread, Diaper}) = 2
 Support (range from 0 to 1)
 Fraction of transactions that contain
an itemset
 E.g. s({Milk, Bread, Diaper}) = 2/5
 Frequent Itemset
 An itemset whose support is greater
than or equal to a minsup threshold
Definition: Association Rule
• Association Rule
– An implication expression of the form X  Y, where X and Y are
itemsets*
– Example:
{Milk, Diaper}  {Beer}

• Important Note
– Association rules do not consider order. So…
TID Items
– {Milk, Diaper}  {Beer} 1 Bread, Milk
and 2 Bread, Diaper, Beer, Eggs
– {Diaper, Milk}  {Beer} 3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
..are the same rule 5 Bread, Milk, Diaper, Coke

*X and Y are disjoint


Definition: Association Rule
• Association Rule
– An implication expression of the form X  Y, where X and Y are
itemsets* TID Items
– Example: 1 Bread, Milk
{Milk, Diaper}  {Beer} 2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
• Rule Evaluation Metrics 4 Bread, Milk, Diaper, Beer
– Support (s) 5 Bread, Milk, Diaper, Coke
• Fraction of transactions that contain both X and Y
– Confidence (c)
• Measures how often items in Y
appear in transactions that
contain X
Definition: Association Rule
• Association Rule
– An implication expression of
the form X  Y, where X and Y TID Items
are itemsets*
1 Bread, Milk
– Example:
2 Bread, Diaper, Beer, Eggs
{Milk, Diaper}  {Beer}
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer
• Rule Evaluation Metrics
5 Bread, Milk, Diaper, Coke
– Support (s)
• Fraction of transactions that
Example:
contain both X and Y
– Confidence (c) {Milk, Diaper}  Beer
• Measures how often items in Y
 (Milk, Diaper, Beer) 2
s  5 0.4
appear in transactions that |T|
contain X
Definition: Association Rule
TID Items
• Association Rule
– An implication expression of 1 Bread, Milk

the form X  Y, where X and Y 2 Bread, Diaper, Beer, Eggs


are itemsets* 3 Milk, Diaper, Beer, Coke
– Example: 4 Bread, Milk, Diaper, Beer
{Milk, Diaper}  {Beer} 5 Bread, Milk, Diaper, Coke

• Rule Evaluation Metrics Example:


– Support (s) {Milk, Diaper}  Beer
• Fraction of transactions that
contain both X and Y  (Milk , Diaper, Beer) 2
– Confidence (c) s  5 0.4
|T|
• Measures how often items in Y

appear in transactions that  (Milk, Diaper, Beer) 2


contain X
c 3 0.67
 (Milk, Diaper)
Association Rules
• Why measure support?
– Very low support rules can happen by chance
– Even if true rules, low support rules are often not
actionable

• Why measure confidence?


– Very low confidence rules are not reliable
Association Rule Mining Task

 Given a set of transactions T, the goal of association rule mining


is to find all rules having
 support ≥ minsup threshold (provided by user)
 confidence ≥ minconf threshold (provided by user)

 Brute-force approach:
 List all possible association rules
 Compute the support and confidence for each rule
 Prune rules that fail the minsup and minconf thresholds
 Computationally prohibitive!
Mining Association Rules

Example of Rules:
TID Items
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
1 Bread, Milk
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
2 Bread, Diaper, Beer, Eggs
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
3 Milk, Diaper, Beer, Coke {Beer}  {Milk,Diaper} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Beer {Diaper}  {Milk,Beer} (s=0.4, c=0.5)
5 Bread, Milk, Diaper, Coke {Milk}  {Diaper,Beer} (s=0.4, c=0.5)

Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we can decouple the support and confidence requirements
Mining Association Rules
 Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset

 Frequent itemset generation is still computationally


expensive
The problem with association rules

 How do we set support and confidence?


 We tend to either find no rules, or a few million
 Given we find a few million, we can rank them using some ranking
function….
There are lots of
measures
proposed in the
literature….
Basic Concepts: Frequent Patterns and
Association Rules

Transaction-id Items bought  Itemset X = {x1, …, xk}


10 A, B, D  Find all the rules X  Y with minimum
20 A, C, D support and confidence
30 A, D, E  support, s, probability that a
40 B, E, F transaction contains X  Y
50 B, C, D, E, F
 confidence, c, conditional
Customer Customer probability that a transaction
buys both buys diaper having X also contains Y

Let supmin = 50%, confmin = 50%


Freq. Pat.: {A:3, B:3, D:4, E:3,
AD:3}
Customer
Association rules:
buys beer
A  D (60%, 100%)
D  A (60%, 75%)
Closed Patterns and Max-Patterns
 A long pattern contains a combinatorial number of sub-
patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
(110000) = 2100 – 1 = 1.27*1030 sub-patterns!
 Solution: Mine closed patterns and max-patterns instead
 An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X An itemset
X is a max-pattern if X is frequent and there exists no
frequent super-pattern Y ‫ כ‬X (proposed by Bayardo @
SIGMOD’98)
 Closed pattern is a lossless compression of freq. patterns
 Reducing the # of patterns and rules
Closed Patterns and Max-Patterns
 Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
 Min_sup = 1.
 What is the set of closed itemset?
 <a1, …, a100>: 1

 < a1, …, a50>: 2

 What is the set of max-pattern?


 <a1, …, a100>: 1
Scalable Methods for Mining Frequent Patterns

 The downward closure property of frequent patterns


 Any subset of a frequent itemset must be frequent
 If {beer, diaper, nuts} is frequent, so is {beer,
diaper}
 i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
 Scalable mining methods: Three major approaches
 Apriori
 Freq. pattern growth
 Vertical data format approach
Apriori: A Candidate Generation-and-Test Approach

 Apriori pruning principle: If there is any itemset which is


infrequent, its superset should not be generated/tested!
 Method:
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length
k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate set can be
generated
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup 2nd scan {A, B}
{A, C} 2
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan
{B, C, E} {B, C, E} 2
The Apriori Algorithm

 Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do

increment the count of all candidates in


Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
Important Details of Apriori
 How to generate candidates?
 Step 1: self-joining Lk
 Step 2: pruning
 How to count supports of candidates?
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}

 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace

 Pruning:
 acde is removed because ade is not in L3

 C4={abcd}
How to Generate Candidates?

 Suppose the items in Lk-1 are listed in an order


 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1

 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do

if (s is not in Lk-1) then delete c from Ck


How to Count Supports of Candidates?

 Why counting supports of candidates a problem?


 The total number of candidates can be very huge
 One transaction may contain many candidates
 Method:
 Candidate itemsets are stored in a hash-tree
 Leaf node of hash-tree contains a list of itemsets and
counts
 Interior node contains a hash table
 Subset function: finds all the candidates contained in a
transaction
Example: Counting Supports of
Candidates

Subset function
Transaction: 1 2 3 5 6
3,6,9
1,4,7
2,5,8

1+2356

13+56 234
567
145 345 356 367
136 368
357
12+356
689
124
457 125 159
458
Challenges of Frequent Pattern Mining

 Challenges
 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for
candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database scans
 Shrink number of candidates
 Facilitate support counting of candidates
Partition: Scan Database Only
Twice
 Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
 Scan 1: partition database and find local frequent
patterns
 Scan 2: consolidate global frequent patterns
Reduce the Number of Candidates
 A k-itemset whose corresponding hashing bucket count is
below the threshold cannot be frequent
 Candidates: a, b, c, d, e
 Hash entries: {ab, ad, ae} {bd, be, de} …
 Frequent 1-itemset: a, b, d, e
 ab is not a candidate 2-itemset if the sum of count of
{ab, ad, ae} is below support threshold
Sampling for Frequent Patterns

 Select a sample of original database, mine frequent


patterns within sample using Apriori
 Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked
 Example: check abcd instead of ab, ac, …, etc.
 Scan database again to find missed frequent patterns

You might also like