0% found this document useful (0 votes)
38 views27 pages

5 DM Association

The document discusses pattern discovery techniques to find hidden relationships between data items by analyzing frequent patterns and association rules. It covers concepts like support, confidence, and the Apriori and FP-growth algorithms for mining frequent itemsets from transaction data in either horizontal or vertical formats.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views27 pages

5 DM Association

The document discusses pattern discovery techniques to find hidden relationships between data items by analyzing frequent patterns and association rules. It covers concepts like support, confidence, and the Apriori and FP-growth algorithms for mining frequent itemsets from transaction data in either horizontal or vertical formats.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Pattern (or Association Rule)

Discovery

1
Pattern Discovery: Definition
• Pattern discovery attempts to discover hidden linkage
between data items.
• Given a set of records each of which contain some
number of items from a given collection.
– Pattern discovery produce dependency rules which will predict
occurrence of an item based on occurrences of other items.
• Motivation of pattern discovery: Finding inherent
regularities in data.
− What products were often purchased together?
o Pasta & Tea?
− What are the subsequent purchases after buying a PC?
− What kinds of DNA are sensitive to the new drug D?
− Can we find redundant tests in medicine? 2
Pattern Discovery: Application
• Shelf management (e.g. Supermarket,
pharmacy, Books shop, etc.)
− Goal: To identify items that are bought together by
sufficiently many customers.
− Approach: Processing sales transaction data collected to
find dependencies among items.
− A classic rule -- If a customer buys Coffee and Milk, then
(s)he is very likely to buy Tea. So, don’t be surprised if
you find six-packs stacked next to Coffee!
{Coffee, Milk} → Tea

3
Prevalent  Interesting Rules
• Analysts already know about prevalent rules
– Interesting rules are those that deviate from Milk and
1995 Eggs sell
prior expectation together!
• Mining’s payoff is in finding interesting
(surprising) phenomena
• What makes a rule surprising?
– Does not match prior expectation
• Correlation between milk and cereal
remains roughly constant over time 1998
Milk and
• Cannot be trivially derived from simpler Zzzz...
cereal sell
rules together!
– Milk 10%, cereal 10%
– Milk & cereal 10% … prevailing
– Eggs 10%
– Milk, cereal & eggs 0.1% … Surprising! 4
Pattern Discovery: Basic concepts
• itemset: A set of one or more items
• k-itemset: X = {x1, …, xk}
• Support, s, is the fraction of transactions that contains
X (i.e., the probability that a transaction contains X)
– support of X and Y greater than user defined threshold s;
that is, support probability of s that a transaction contains
XY
–An itemset X is frequent if X’s support is no less than a
minsup threshold
• Confidence: is the probability of finding Y in a
transaction with all X1,X2,…,Xn .
– confidence, c, conditional prob. that a transaction having X
also contains Y; i.e. conditional prob. (confidence) of Y given
X > user threshold c 5
Steps in Pattern Discovery
• It finds itemsets that appear “frequently” in the baskets.
• The problem of pattern discovery can be generalized into
two steps:
1. Finding frequent patterns from large set of items
− Frequent pattern: a pattern (itemsets, subsequences,
substructures, etc.) that occurs frequently in a dataset.
− An itemset is said to be frequent itemset if the itemset
appear frequently together in a transaction dataset.
o For example - a milk and bread may occur together frequently in a
single transaction and hence are frequent itemset.
− Subsequence refers to items that happen in transaction in a
sequential order.
o For example - buying computer at time t0 may be followed by buying
a digital camera at time t1, and buying memory card at time t2.

6
Steps in Pattern Discovery …
− A subsequence that appear most frequently is said to be
frequent subsequence.
− A substructure refers to different structural forms of the
dataset, such as sub-graphs, sub-trees, or sub-lattices,
which may be combined with itemsets or subsequences.
− If a substructure occurs frequently, it is called a (frequent)
structured pattern.
− Finding such frequent patterns plays an essential role in
mining associations, correlations, classification, clustering,
and other data mining tasks as well.
− Thus, frequent pattern mining has become an important
data mining task and a focused theme in data mining
research.
− This chapter is dedicated to methods of frequent itemset
mining. 7
Steps in Pattern Discovery …
2. Generating association rules from these itemsets.
• Association rules are defined as statements of the form
{X1,X2,…,Xn} -> Y, which means that Y may present in the
transaction if X1,X2,…,Xn are all in the transaction.
• Example: Rules Discovered can be -
{Milk} --> {Coke}
{Tea, Milk} --> {Coke}

8
Example: Finding frequent itemsets
• Given a support threshold (X > S), sets of X items that
appear in greater than or equal to S baskets are
called frequent itemsets.
• Example: Frequent Itemsets
– Itemsets bought = {milk, coke, pepsi, biscuit, juice}.
– Support = 4 baskets.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
– Frequent itemsets: {m}, {c}, {b}, {j}, {m,b} , {b,c}. 9
Association Rules
• Find all rules on itemsets of the form X→Y with minimum
support and confidence.
– If-then rules about the contents of baskets:
• {i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it
is likely to contain j.”
• A typical question: “find all association rules with support ≥ s
and confidence ≥ c.” Note: “support” of an association rule is the
support of the set of items it mentions.
– Confidence of this association rule is the probability of j given
i1,…,ik. It is the number of transactions i1,…,ik containing item j
– Example: Confidence
B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b}
B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
• An association rule: {m, b} → c (with confidence = 2/4 = 50%). 10
Frequent Itemset Mining Methods
• The downward closure property of frequent
patterns:
− Any subset of a frequent itemset must be frequent
− If {Coke, Tea, nuts} is frequent, so is {Coke, Tea}
− i.e., every transaction having {Coke, Tea, nuts} also
contains {Coke, Tea}
• The hardest problem often turns out to be finding
the frequent pairs.

12
Frequent Itemset Mining Methods
• Apriori: A Candidate Generation-and-Test Approach
– A two-pass approach called a-priori limits the need for
main memory.
– Key idea: if a set of items appears at least s times, so does
every subset.
• Contra-positive for pairs: if item i does not appear in s baskets,
then no pair including i can appear in s baskets.
• FPGrowth: A Frequent Pattern-Growth Approach
– Mining Frequent Patterns Without Candidate Generation
– Uses the Apriori Pruning Principle to generate frequent
itemsets
– Scan DB only twice!
• Once to find frequent 1-itemset (single item pattern)
• Once to construct FP-tree, the data structure of FPGrowth
• Vertical Data Format
13
Frequent Itemset Mining Methods …
• Both the Apriori and FP-growth methods mine
frequent patterns from a set of transactions in TID-
itemset format (i.e., {TID: itemset}), where TID is a
transaction ID and itemset is the set of items bought in
transaction TID. This is known as the horizontal data
format. TID Itemset
1 {Biscuits, Bread, Cheese, Yogurt, Sugar}
2 {Bread, Cheese, Coffee, Sugar}

• Alternatively, data can be presented in item-TID_set


format (i.e., {itemset: TID_set}), where itemset is items
name, and TID_set is the set of transaction identifiers
containing the items. This is known as the vertical data
Itemset TID_set
format.
{Bread, Cheese} {1, 2, 4}
14
{Cheese, Sugar} {1, 2, 3}
A-Priori Algorithm
• Pass 1: Read baskets and
count in main memory
the occurrences of each Item counts Frequent items
item.
– Requires only memory
proportional to number
items. Counts of
• Pass 2: Read baskets candidate
again and count in main pairs
memory only those pairs
both of which were found
in Pass 1 to be frequent.
– Requires memory Pass 1 Pass 2
proportional to square of
frequent items only.

15
Apriori: A Candidate Generation & Test Approach
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested.
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k) candidate itemsets from length k
frequent itemset. For each k, we construct two sets of k –
tuples:
• Ck = candidate k - tuples = those that might be frequent sets
(support > s ) based on information from the pass for k –1.
• Lk = the set of truly frequent k –tuples.
– Test the candidates against DB
– Terminate when no frequent or candidate set can be
generated. 16
A-Priori for All Frequent Itemsets
• C1 = all items; L1 = those counted on first pass to be
frequent.; C2 = pairs with support ≥ s, both chosen from
L1; In general, Ck = k –tuples, each k –1 of which is in Lk -1; Lk
= members of Ck with support ≥ s

Count All pairs Count All pairs of


All of items items from
the items the pairs
items from L1 L2

C1 Filter L1 Construct C2 Filter L2 Construct C3

First pass Second pass 17


The Apriori Algorithm—An Example
Assume that min Support = 50% and min confidence = 80%, identify
frequent itemsets and construct association rules.
Database TDB Itemset sup
{A} 2 L1 Itemset sup
Tid Items C1
{B} 3 {A} 2
10 A, C, D
{C} 3 {B} 3
20 B, C, E 1st scan
{D} 1 {C} 3
30 A, B, C, E
{E} 3 {E} 3
40 B, E

Itemset sup C2 Itemset sup C2 Itemset


{A, C} 2 {A, B} 1
L2
{B, C} 2 {A, C} 2
2nd scan {A, B}
{B, E} 3 {A, E} 1 {A, C}
{C, E} 2 {B, C} 2 {A, E}
{B, E} 3 {B, C}
Itemset {C, E} 2 {B, E}
{A, B, C}
{C, E}
{A, B, E}
C3 L3 Itemset sup
{A, C, E} 3rd scan {B, C, E} 2
{B, C, E} 18
Which of the above pairs fulfill
confidence level at least 80%?
Pairs Support Confidence
A→C 50% 100%
B→C 50% 66.67%
B→E 75% 100%
C→E 50% 66.67%
(B,C) →E 50% 100%
(B,E)→C 75% 66.67%
Results:
A→C (with support 50%, confidence 100%)
B→E (with support 75%, confidence 100%)
(B,C)→E (with support 50%, confidence 100%)
19
Exercise
• The ‘database’ below has four transactions -
what association rules can be found in this
set, if the minimum support is 60% and the
minimum confidence is 80%?
Tid Items
100 K, A, D, B
200 D, A, C, E, B
300 C, A, B, E
400 B, A, D

20
Bottlenecks of the Apriori approach
• The Apriori algorithm reduces the size of candidate frequent
itemsets by using “Apriori property” - all nonempty subsets of
a frequent itemset must also be frequent.
• However, it still requires two nontrivial computationally
expensive processes.
• It requires as many database scans as the size of the largest
frequent itemsets. In order to find frequent k-itemsets, the
Apriori algorithm needs to scan database k times.
• Breadth-first (i.e., level-wise) search
– Candidate generation and test the frequency of true
appearance of the itemsets.
– It may generate a huge number of candidate sets that will
be discarded later in the test stage. 21
Pattern-Growth Approach
• The FPGrowth Approach
– Depth-first search: search depth wise by identifying different
set of combinations with a given single or pair of items.
– Avoid explicit candidate generation, rather it generates
frequent itemsets.
• Major philosophy: Grow long patterns from short ones using
local frequent items only.
– “abc” is a frequent pattern
– Get all transactions having “abc”, i.e., project DB on abc:
DB|abc
– “d” is a local frequent item in DB|abc → abcd is a frequent
pattern 22
Construct FP-tree from a Transaction Database
Assume min_support = 3 and min_confidence = 80%
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
1.Scan DB once, find Header Table
frequent 1-itemset Item frequency head
f:4 c:1
(single item pattern) f 4
c:3 b:1 b:1
c 4
2.Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3 m:2 b:1
3.Scan DB again,
construct FP-tree F-list = f-c-a-b-m-p p:2 m:1 23
FP-Growth Example
• Construct conditional pattern base which consists of the set of prefix
paths in the FP tree co-occuring with the suffix pattern, and then
construct its conditional FP-tree.

Item Conditional pattern-base Conditional FP-tree


p {(fcam:2), (cb:1)} {(c:3)}|p
m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m
b {(fca:1), (f:1), (c:1)} --
a {(fc:3)} {(f:3, c:3)}|a
c {(f:3)} {(f:3)}|c
f -- --
Which of the above pairs fulfill
confidence level at least 80%?
Pairs Support Confidence
c→p 3 75%
fca→m 3 100%
ca→m 3 100%
f→a 3 75%
c→a 3 75%
f→c 3 100%
Results: generate association rules?

25
Exercise
• The below given data is a hypothetical dataset of
transactions with each letter representing an item. Let
the min_support = 3 and min_confidence = 80%.

TID Items bought


1 {E, K, M, N, O, Y}
2 {D, E, K, N, O, Y}
3 {A, E, K, M}
4 {C, K, M, U, Y}
5 {C, E, I, K, O}

• From the conditional frequent pattern tree, generate the


26
frequent pattern rules.
Benefits of the FP-tree Structure
• Completeness
– Preserve complete information for frequent pattern
mining
– Never break a long pattern of any transaction
• Compactness
– Reduce irrelevant info—infrequent items are gone
– Items in frequency descending order: the more
frequently occurring, the more likely to be shared
– Never be larger than the original database (not count
node-links and the count field)

27
Project (Due: ________)
• Requirement - What you need to do for this project is:
− Choose dataset with 10+ attributes and at least 1000 instances.
As much as possible try to use local data to make analysis easy;
otherwise go to the URL: http://www.kdnuggets.com/datasets/
− Preprocess the dataset if there are any incomplete data, missing
values, outliers & unbalanced.
− Choose at least two algorithms of classification, clustering or
association rule discovery that are implemented in Weka.
− Use the chosen algorithm to run the dataset selected & prepared
• Project Report - Write a publishable report with the following
sections:
− Introduction (the problem, objective & methodology of the
study)
− Review related works
− Data preparation
− Experimental setup (mining method & parameters used for the
experiment)
− Summary of experimental result & findings of the study
− Concluding remarks
− Reference
28

You might also like