0% found this document useful (0 votes)

38 views27 pages

5 DM Association

The document discusses pattern discovery techniques to find hidden relationships between data items by analyzing frequent patterns and association rules. It covers concepts like support, confidence, and the Apriori and FP-growth algorithms for mining frequent itemsets from transaction data in either horizontal or vertical formats.

Uploaded by

dawit gebreyohans

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views27 pages

5 DM Association

Uploaded by

dawit gebreyohans

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Pattern (or Association Rule)

Discovery

1
Pattern Discovery: Definition
• Pattern discovery attempts to discover hidden linkage
between data items.
• Given a set of records each of which contain some
number of items from a given collection.
– Pattern discovery produce dependency rules which will predict
occurrence of an item based on occurrences of other items.
• Motivation of pattern discovery: Finding inherent
regularities in data.
− What products were often purchased together?
o Pasta & Tea?
− What are the subsequent purchases after buying a PC?
− What kinds of DNA are sensitive to the new drug D?
− Can we find redundant tests in medicine? 2
Pattern Discovery: Application
• Shelf management (e.g. Supermarket,
pharmacy, Books shop, etc.)
− Goal: To identify items that are bought together by
sufficiently many customers.
− Approach: Processing sales transaction data collected to
find dependencies among items.
− A classic rule -- If a customer buys Coffee and Milk, then
(s)he is very likely to buy Tea. So, don’t be surprised if
you find six-packs stacked next to Coffee!
{Coffee, Milk} → Tea

3
Prevalent  Interesting Rules
• Analysts already know about prevalent rules
– Interesting rules are those that deviate from Milk and
1995 Eggs sell
prior expectation together!
• Mining’s payoff is in finding interesting
(surprising) phenomena
• What makes a rule surprising?
– Does not match prior expectation
• Correlation between milk and cereal
remains roughly constant over time 1998
Milk and
• Cannot be trivially derived from simpler Zzzz...
cereal sell
rules together!
– Milk 10%, cereal 10%
– Milk & cereal 10% … prevailing
– Eggs 10%
– Milk, cereal & eggs 0.1% … Surprising! 4
Pattern Discovery: Basic concepts
• itemset: A set of one or more items
• k-itemset: X = {x1, …, xk}
• Support, s, is the fraction of transactions that contains
X (i.e., the probability that a transaction contains X)
– support of X and Y greater than user defined threshold s;
that is, support probability of s that a transaction contains
XY
–An itemset X is frequent if X’s support is no less than a
minsup threshold
• Confidence: is the probability of finding Y in a
transaction with all X1,X2,…,Xn .
– confidence, c, conditional prob. that a transaction having X
also contains Y; i.e. conditional prob. (confidence) of Y given
X > user threshold c 5
Steps in Pattern Discovery
• It finds itemsets that appear “frequently” in the baskets.
• The problem of pattern discovery can be generalized into
two steps:
1. Finding frequent patterns from large set of items
− Frequent pattern: a pattern (itemsets, subsequences,
substructures, etc.) that occurs frequently in a dataset.
− An itemset is said to be frequent itemset if the itemset
appear frequently together in a transaction dataset.
o For example - a milk and bread may occur together frequently in a
single transaction and hence are frequent itemset.
− Subsequence refers to items that happen in transaction in a
sequential order.
o For example - buying computer at time t0 may be followed by buying
a digital camera at time t1, and buying memory card at time t2.

6
Steps in Pattern Discovery …
− A subsequence that appear most frequently is said to be
frequent subsequence.
− A substructure refers to different structural forms of the
dataset, such as sub-graphs, sub-trees, or sub-lattices,
which may be combined with itemsets or subsequences.
− If a substructure occurs frequently, it is called a (frequent)
structured pattern.
− Finding such frequent patterns plays an essential role in
mining associations, correlations, classification, clustering,
and other data mining tasks as well.
− Thus, frequent pattern mining has become an important
data mining task and a focused theme in data mining
research.
− This chapter is dedicated to methods of frequent itemset
mining. 7
Steps in Pattern Discovery …
2. Generating association rules from these itemsets.
• Association rules are defined as statements of the form
{X1,X2,…,Xn} -> Y, which means that Y may present in the
transaction if X1,X2,…,Xn are all in the transaction.
• Example: Rules Discovered can be -
{Milk} --> {Coke}
{Tea, Milk} --> {Coke}

8
Example: Finding frequent itemsets
• Given a support threshold (X > S), sets of X items that
appear in greater than or equal to S baskets are
called frequent itemsets.
• Example: Frequent Itemsets
– Itemsets bought = {milk, coke, pepsi, biscuit, juice}.
– Support = 4 baskets.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
– Frequent itemsets: {m}, {c}, {b}, {j}, {m,b} , {b,c}. 9
Association Rules
• Find all rules on itemsets of the form X→Y with minimum
support and confidence.
– If-then rules about the contents of baskets:
• {i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it
is likely to contain j.”
• A typical question: “find all association rules with support ≥ s
and confidence ≥ c.” Note: “support” of an association rule is the
support of the set of items it mentions.
– Confidence of this association rule is the probability of j given
i1,…,ik. It is the number of transactions i1,…,ik containing item j
– Example: Confidence
B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b}
B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
• An association rule: {m, b} → c (with confidence = 2/4 = 50%). 10
Frequent Itemset Mining Methods
• The downward closure property of frequent
patterns:
− Any subset of a frequent itemset must be frequent
− If {Coke, Tea, nuts} is frequent, so is {Coke, Tea}
− i.e., every transaction having {Coke, Tea, nuts} also
contains {Coke, Tea}
• The hardest problem often turns out to be finding
the frequent pairs.

12
Frequent Itemset Mining Methods
• Apriori: A Candidate Generation-and-Test Approach
– A two-pass approach called a-priori limits the need for
main memory.
– Key idea: if a set of items appears at least s times, so does
every subset.
• Contra-positive for pairs: if item i does not appear in s baskets,
then no pair including i can appear in s baskets.
• FPGrowth: A Frequent Pattern-Growth Approach
– Mining Frequent Patterns Without Candidate Generation
– Uses the Apriori Pruning Principle to generate frequent
itemsets
– Scan DB only twice!
• Once to find frequent 1-itemset (single item pattern)
• Once to construct FP-tree, the data structure of FPGrowth
• Vertical Data Format
13
Frequent Itemset Mining Methods …
• Both the Apriori and FP-growth methods mine
frequent patterns from a set of transactions in TID-
itemset format (i.e., {TID: itemset}), where TID is a
transaction ID and itemset is the set of items bought in
transaction TID. This is known as the horizontal data
format. TID Itemset
1 {Biscuits, Bread, Cheese, Yogurt, Sugar}
2 {Bread, Cheese, Coffee, Sugar}

• Alternatively, data can be presented in item-TID_set

format (i.e., {itemset: TID_set}), where itemset is items
name, and TID_set is the set of transaction identifiers
containing the items. This is known as the vertical data
Itemset TID_set
format.
{Bread, Cheese} {1, 2, 4}
14
{Cheese, Sugar} {1, 2, 3}
A-Priori Algorithm
• Pass 1: Read baskets and
count in main memory
the occurrences of each Item counts Frequent items
item.
– Requires only memory
proportional to number
items. Counts of
• Pass 2: Read baskets candidate
again and count in main pairs
memory only those pairs
both of which were found
in Pass 1 to be frequent.
– Requires memory Pass 1 Pass 2
proportional to square of
frequent items only.

15
Apriori: A Candidate Generation & Test Approach
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested.
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k) candidate itemsets from length k
frequent itemset. For each k, we construct two sets of k –
tuples:
• Ck = candidate k - tuples = those that might be frequent sets
(support > s ) based on information from the pass for k –1.
• Lk = the set of truly frequent k –tuples.
– Test the candidates against DB
– Terminate when no frequent or candidate set can be
generated. 16
A-Priori for All Frequent Itemsets
• C1 = all items; L1 = those counted on first pass to be
frequent.; C2 = pairs with support ≥ s, both chosen from
L1; In general, Ck = k –tuples, each k –1 of which is in Lk -1; Lk
= members of Ck with support ≥ s

Count All pairs Count All pairs of

All of items items from
the items the pairs
items from L1 L2

C1 Filter L1 Construct C2 Filter L2 Construct C3

First pass Second pass 17

The Apriori Algorithm—An Example
Assume that min Support = 50% and min confidence = 80%, identify
frequent itemsets and construct association rules.
Database TDB Itemset sup
{A} 2 L1 Itemset sup
Tid Items C1
{B} 3 {A} 2
10 A, C, D
{C} 3 {B} 3
20 B, C, E 1st scan
{D} 1 {C} 3
30 A, B, C, E
{E} 3 {E} 3
40 B, E

Itemset sup C2 Itemset sup C2 Itemset

{A, C} 2 {A, B} 1
L2
{B, C} 2 {A, C} 2
2nd scan {A, B}
{B, E} 3 {A, E} 1 {A, C}
{C, E} 2 {B, C} 2 {A, E}
{B, E} 3 {B, C}
Itemset {C, E} 2 {B, E}
{A, B, C}
{C, E}
{A, B, E}
C3 L3 Itemset sup
{A, C, E} 3rd scan {B, C, E} 2
{B, C, E} 18
Which of the above pairs fulfill
confidence level at least 80%?
Pairs Support Confidence
A→C 50% 100%
B→C 50% 66.67%
B→E 75% 100%
C→E 50% 66.67%
(B,C) →E 50% 100%
(B,E)→C 75% 66.67%
Results:
A→C (with support 50%, confidence 100%)
B→E (with support 75%, confidence 100%)
(B,C)→E (with support 50%, confidence 100%)
19
Exercise
• The ‘database’ below has four transactions -
what association rules can be found in this
set, if the minimum support is 60% and the
minimum confidence is 80%?
Tid Items
100 K, A, D, B
200 D, A, C, E, B
300 C, A, B, E
400 B, A, D

20
Bottlenecks of the Apriori approach
• The Apriori algorithm reduces the size of candidate frequent
itemsets by using “Apriori property” - all nonempty subsets of
a frequent itemset must also be frequent.
• However, it still requires two nontrivial computationally
expensive processes.
• It requires as many database scans as the size of the largest
frequent itemsets. In order to find frequent k-itemsets, the
Apriori algorithm needs to scan database k times.
• Breadth-first (i.e., level-wise) search
– Candidate generation and test the frequency of true
appearance of the itemsets.
– It may generate a huge number of candidate sets that will
be discarded later in the test stage. 21
Pattern-Growth Approach
• The FPGrowth Approach
– Depth-first search: search depth wise by identifying different
set of combinations with a given single or pair of items.
– Avoid explicit candidate generation, rather it generates
frequent itemsets.
• Major philosophy: Grow long patterns from short ones using
local frequent items only.
– “abc” is a frequent pattern
– Get all transactions having “abc”, i.e., project DB on abc:
DB|abc
– “d” is a local frequent item in DB|abc → abcd is a frequent
pattern 22
Construct FP-tree from a Transaction Database
Assume min_support = 3 and min_confidence = 80%
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
1.Scan DB once, find Header Table
frequent 1-itemset Item frequency head
f:4 c:1
(single item pattern) f 4
c:3 b:1 b:1
c 4
2.Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3 m:2 b:1
3.Scan DB again,
construct FP-tree F-list = f-c-a-b-m-p p:2 m:1 23
FP-Growth Example
• Construct conditional pattern base which consists of the set of prefix
paths in the FP tree co-occuring with the suffix pattern, and then
construct its conditional FP-tree.

Item Conditional pattern-base Conditional FP-tree

p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} --
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f -- --
Which of the above pairs fulfill
confidence level at least 80%?
Pairs Support Confidence
c→p 3 75%
fca→m 3 100%
ca→m 3 100%
f→a 3 75%
c→a 3 75%
f→c 3 100%
Results: generate association rules?

25
Exercise
• The below given data is a hypothetical dataset of
transactions with each letter representing an item. Let
the min_support = 3 and min_confidence = 80%.

TID Items bought

1 {E, K, M, N, O, Y}
2 {D, E, K, N, O, Y}
3 {A, E, K, M}
4 {C, K, M, U, Y}
5 {C, E, I, K, O}

• From the conditional frequent pattern tree, generate the

26
frequent pattern rules.
Benefits of the FP-tree Structure
• Completeness
– Preserve complete information for frequent pattern
mining
– Never break a long pattern of any transaction
• Compactness
– Reduce irrelevant info—infrequent items are gone
– Items in frequency descending order: the more
frequently occurring, the more likely to be shared
– Never be larger than the original database (not count
node-links and the count field)

27
Project (Due: ________)
• Requirement - What you need to do for this project is:
− Choose dataset with 10+ attributes and at least 1000 instances.
As much as possible try to use local data to make analysis easy;
otherwise go to the URL: http://www.kdnuggets.com/datasets/
− Preprocess the dataset if there are any incomplete data, missing
values, outliers & unbalanced.
− Choose at least two algorithms of classification, clustering or
association rule discovery that are implemented in Weka.
− Use the chosen algorithm to run the dataset selected & prepared
• Project Report - Write a publishable report with the following
sections:
− Introduction (the problem, objective & methodology of the
study)
− Review related works
− Data preparation
− Experimental setup (mining method & parameters used for the
experiment)
− Summary of experimental result & findings of the study
− Concluding remarks
− Reference
28

Data Mining Unit-Ii Notes
No ratings yet
Data Mining Unit-Ii Notes
24 pages
DM Chapter 6 (Association)
100% (1)
DM Chapter 6 (Association)
21 pages
Chap4 PatternMiningBasic
No ratings yet
Chap4 PatternMiningBasic
52 pages
Chap4 PatternMiningBasic
No ratings yet
Chap4 PatternMiningBasic
52 pages
CS 412 Intro. To Data Mining
No ratings yet
CS 412 Intro. To Data Mining
55 pages
CSE 385 - Data Mining and Business Intelligence - Lecture 02
No ratings yet
CSE 385 - Data Mining and Business Intelligence - Lecture 02
67 pages
DM 2
No ratings yet
DM 2
71 pages
Week 3
No ratings yet
Week 3
56 pages
Mining Frequent Patterns and Associations
No ratings yet
Mining Frequent Patterns and Associations
52 pages
CH-4 Mining Association Rules
No ratings yet
CH-4 Mining Association Rules
35 pages
Unit2 Apriori FP Growth
No ratings yet
Unit2 Apriori FP Growth
27 pages
P8 FPBasic
No ratings yet
P8 FPBasic
53 pages
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Associations and Correlations: Basic Concepts and Methods
20 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
67 pages
06 FPBasic
No ratings yet
06 FPBasic
69 pages
FP Tree Basics
No ratings yet
FP Tree Basics
67 pages
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
No ratings yet
Data Mining - : Dr. Mahmoud Mounir Mahmoud - Mounir@cis - Asu.edu - Eg
26 pages
Association
No ratings yet
Association
40 pages
Association Rule Mining:: Dm-Unit-2
No ratings yet
Association Rule Mining:: Dm-Unit-2
16 pages
Equent Patterns
No ratings yet
Equent Patterns
74 pages
Unit 4 - Part 1
No ratings yet
Unit 4 - Part 1
152 pages
Concepts and Techniques: - Chapter 6
No ratings yet
Concepts and Techniques: - Chapter 6
64 pages
Chapter 5 Data Mining: Dr. Huma Lone
No ratings yet
Chapter 5 Data Mining: Dr. Huma Lone
56 pages
Association Rules
No ratings yet
Association Rules
48 pages
5 Frequent Pattern Mining
No ratings yet
5 Frequent Pattern Mining
44 pages
DWDM - Unit - IV
No ratings yet
DWDM - Unit - IV
67 pages
DM Unit - 2
No ratings yet
DM Unit - 2
14 pages
Data Mining Association Rules
No ratings yet
Data Mining Association Rules
54 pages
Unit 2
No ratings yet
Unit 2
65 pages
BCA Semester VI Data Mining Module 3 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 3 (Presentation Kind of N
108 pages
06 Association Rule Mining
No ratings yet
06 Association Rule Mining
20 pages
DMDW Chapter 4
No ratings yet
DMDW Chapter 4
28 pages
Unit 3
No ratings yet
Unit 3
62 pages
Associationrule 1
No ratings yet
Associationrule 1
30 pages
DM Lect7
No ratings yet
DM Lect7
26 pages
Chap 4-Mining Frequent Patterns, Association-Lecture 6-2
No ratings yet
Chap 4-Mining Frequent Patterns, Association-Lecture 6-2
66 pages
33 GM - ASAP-Association Rule Mining
No ratings yet
33 GM - ASAP-Association Rule Mining
64 pages
38 GM - ASAP-Association Rule Mining
No ratings yet
38 GM - ASAP-Association Rule Mining
64 pages
DM - Unit II
No ratings yet
DM - Unit II
65 pages
Updated Module 3
No ratings yet
Updated Module 3
31 pages
Data Mining Session 6 - Main Theme Mining Frequent Patterns, Association, and Correlations Dr. Jean-Claude Franchitti
No ratings yet
Data Mining Session 6 - Main Theme Mining Frequent Patterns, Association, and Correlations Dr. Jean-Claude Franchitti
66 pages
DWDWM Unit2
No ratings yet
DWDWM Unit2
59 pages
Frequent Pattern Based Clustering Methods
No ratings yet
Frequent Pattern Based Clustering Methods
23 pages
Association Rule Mining
No ratings yet
Association Rule Mining
54 pages
Unit - 3 Mining Frequent Patterns
No ratings yet
Unit - 3 Mining Frequent Patterns
10 pages
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
No ratings yet
Mining Frequent Patterns, Association and Correlations - Basic Concepts and Methods
55 pages
DMDW Chapter 4 (Updated)
No ratings yet
DMDW Chapter 4 (Updated)
28 pages
Slide 06 Chapter6 Frequent Itemset Mining Methods
No ratings yet
Slide 06 Chapter6 Frequent Itemset Mining Methods
62 pages
Fundamentals of Data Science Unit 5
No ratings yet
Fundamentals of Data Science Unit 5
25 pages
Slides 06FPBasic
No ratings yet
Slides 06FPBasic
30 pages
M9 Asosiasi
No ratings yet
M9 Asosiasi
58 pages
04 FPbasic
No ratings yet
04 FPbasic
78 pages
KDDM-Lecture 3
No ratings yet
KDDM-Lecture 3
21 pages
Association Rule Mining Spring 2022
No ratings yet
Association Rule Mining Spring 2022
84 pages
FDS Unit - 3
No ratings yet
FDS Unit - 3
10 pages
06 FPBasic
No ratings yet
06 FPBasic
37 pages
Note 1455181909
No ratings yet
Note 1455181909
30 pages
Frequent Itemset Mining
No ratings yet
Frequent Itemset Mining
58 pages
CH - 5
No ratings yet
CH - 5
43 pages
Lesson-2 LAN Design
No ratings yet
Lesson-2 LAN Design
116 pages
Data Mining With Weka - Demo
No ratings yet
Data Mining With Weka - Demo
12 pages
Labs
No ratings yet
Labs
35 pages
Lesson 2
No ratings yet
Lesson 2
30 pages
Lesson 1
No ratings yet
Lesson 1
27 pages
DM Intro - 1
No ratings yet
DM Intro - 1
31 pages
Chap8 Advanced Cluster Analysis
No ratings yet
Chap8 Advanced Cluster Analysis
45 pages
ZEB PPT - PPTX (Autosaved)
No ratings yet
ZEB PPT - PPTX (Autosaved)
11 pages
6 WebMining
No ratings yet
6 WebMining
45 pages
Clustering Examples
No ratings yet
Clustering Examples
47 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
Optimizing Data Warehousing Performance Through Machine Learning
No ratings yet
Optimizing Data Warehousing Performance Through Machine Learning
10 pages

5 DM Association

Uploaded by

5 DM Association

Uploaded by

Pattern (or Association Rule)

• Alternatively, data can be presented in item-TID_set

Count All pairs Count All pairs of

C1 Filter L1 Construct C2 Filter L2 Construct C3

First pass Second pass 17

Itemset sup C2 Itemset sup C2 Itemset

Item Conditional pattern-base Conditional FP-tree

TID Items bought

• From the conditional frequent pattern tree, generate the

You might also like