5 DM Association
5 DM Association
Discovery
1
Pattern Discovery: Definition
• Pattern discovery attempts to discover hidden linkage
between data items.
• Given a set of records each of which contain some
number of items from a given collection.
– Pattern discovery produce dependency rules which will predict
occurrence of an item based on occurrences of other items.
• Motivation of pattern discovery: Finding inherent
regularities in data.
− What products were often purchased together?
o Pasta & Tea?
− What are the subsequent purchases after buying a PC?
− What kinds of DNA are sensitive to the new drug D?
− Can we find redundant tests in medicine? 2
Pattern Discovery: Application
• Shelf management (e.g. Supermarket,
pharmacy, Books shop, etc.)
− Goal: To identify items that are bought together by
sufficiently many customers.
− Approach: Processing sales transaction data collected to
find dependencies among items.
− A classic rule -- If a customer buys Coffee and Milk, then
(s)he is very likely to buy Tea. So, don’t be surprised if
you find six-packs stacked next to Coffee!
{Coffee, Milk} → Tea
3
Prevalent Interesting Rules
• Analysts already know about prevalent rules
– Interesting rules are those that deviate from Milk and
1995 Eggs sell
prior expectation together!
• Mining’s payoff is in finding interesting
(surprising) phenomena
• What makes a rule surprising?
– Does not match prior expectation
• Correlation between milk and cereal
remains roughly constant over time 1998
Milk and
• Cannot be trivially derived from simpler Zzzz...
cereal sell
rules together!
– Milk 10%, cereal 10%
– Milk & cereal 10% … prevailing
– Eggs 10%
– Milk, cereal & eggs 0.1% … Surprising! 4
Pattern Discovery: Basic concepts
• itemset: A set of one or more items
• k-itemset: X = {x1, …, xk}
• Support, s, is the fraction of transactions that contains
X (i.e., the probability that a transaction contains X)
– support of X and Y greater than user defined threshold s;
that is, support probability of s that a transaction contains
XY
–An itemset X is frequent if X’s support is no less than a
minsup threshold
• Confidence: is the probability of finding Y in a
transaction with all X1,X2,…,Xn .
– confidence, c, conditional prob. that a transaction having X
also contains Y; i.e. conditional prob. (confidence) of Y given
X > user threshold c 5
Steps in Pattern Discovery
• It finds itemsets that appear “frequently” in the baskets.
• The problem of pattern discovery can be generalized into
two steps:
1. Finding frequent patterns from large set of items
− Frequent pattern: a pattern (itemsets, subsequences,
substructures, etc.) that occurs frequently in a dataset.
− An itemset is said to be frequent itemset if the itemset
appear frequently together in a transaction dataset.
o For example - a milk and bread may occur together frequently in a
single transaction and hence are frequent itemset.
− Subsequence refers to items that happen in transaction in a
sequential order.
o For example - buying computer at time t0 may be followed by buying
a digital camera at time t1, and buying memory card at time t2.
6
Steps in Pattern Discovery …
− A subsequence that appear most frequently is said to be
frequent subsequence.
− A substructure refers to different structural forms of the
dataset, such as sub-graphs, sub-trees, or sub-lattices,
which may be combined with itemsets or subsequences.
− If a substructure occurs frequently, it is called a (frequent)
structured pattern.
− Finding such frequent patterns plays an essential role in
mining associations, correlations, classification, clustering,
and other data mining tasks as well.
− Thus, frequent pattern mining has become an important
data mining task and a focused theme in data mining
research.
− This chapter is dedicated to methods of frequent itemset
mining. 7
Steps in Pattern Discovery …
2. Generating association rules from these itemsets.
• Association rules are defined as statements of the form
{X1,X2,…,Xn} -> Y, which means that Y may present in the
transaction if X1,X2,…,Xn are all in the transaction.
• Example: Rules Discovered can be -
{Milk} --> {Coke}
{Tea, Milk} --> {Coke}
8
Example: Finding frequent itemsets
• Given a support threshold (X > S), sets of X items that
appear in greater than or equal to S baskets are
called frequent itemsets.
• Example: Frequent Itemsets
– Itemsets bought = {milk, coke, pepsi, biscuit, juice}.
– Support = 4 baskets.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
– Frequent itemsets: {m}, {c}, {b}, {j}, {m,b} , {b,c}. 9
Association Rules
• Find all rules on itemsets of the form X→Y with minimum
support and confidence.
– If-then rules about the contents of baskets:
• {i1, i2,…,ik} → j means: “if a basket contains all of i1,…,ik then it
is likely to contain j.”
• A typical question: “find all association rules with support ≥ s
and confidence ≥ c.” Note: “support” of an association rule is the
support of the set of items it mentions.
– Confidence of this association rule is the probability of j given
i1,…,ik. It is the number of transactions i1,…,ik containing item j
– Example: Confidence
B1 = {m, c, b} B2 = {m, p, j} B3 = {m, b}
B4 = {c, j} B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
• An association rule: {m, b} → c (with confidence = 2/4 = 50%). 10
Frequent Itemset Mining Methods
• The downward closure property of frequent
patterns:
− Any subset of a frequent itemset must be frequent
− If {Coke, Tea, nuts} is frequent, so is {Coke, Tea}
− i.e., every transaction having {Coke, Tea, nuts} also
contains {Coke, Tea}
• The hardest problem often turns out to be finding
the frequent pairs.
12
Frequent Itemset Mining Methods
• Apriori: A Candidate Generation-and-Test Approach
– A two-pass approach called a-priori limits the need for
main memory.
– Key idea: if a set of items appears at least s times, so does
every subset.
• Contra-positive for pairs: if item i does not appear in s baskets,
then no pair including i can appear in s baskets.
• FPGrowth: A Frequent Pattern-Growth Approach
– Mining Frequent Patterns Without Candidate Generation
– Uses the Apriori Pruning Principle to generate frequent
itemsets
– Scan DB only twice!
• Once to find frequent 1-itemset (single item pattern)
• Once to construct FP-tree, the data structure of FPGrowth
• Vertical Data Format
13
Frequent Itemset Mining Methods …
• Both the Apriori and FP-growth methods mine
frequent patterns from a set of transactions in TID-
itemset format (i.e., {TID: itemset}), where TID is a
transaction ID and itemset is the set of items bought in
transaction TID. This is known as the horizontal data
format. TID Itemset
1 {Biscuits, Bread, Cheese, Yogurt, Sugar}
2 {Bread, Cheese, Coffee, Sugar}
15
Apriori: A Candidate Generation & Test Approach
• Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested.
• Method:
– Initially, scan DB once to get frequent 1-itemset
– Generate length (k) candidate itemsets from length k
frequent itemset. For each k, we construct two sets of k –
tuples:
• Ck = candidate k - tuples = those that might be frequent sets
(support > s ) based on information from the pass for k –1.
• Lk = the set of truly frequent k –tuples.
– Test the candidates against DB
– Terminate when no frequent or candidate set can be
generated. 16
A-Priori for All Frequent Itemsets
• C1 = all items; L1 = those counted on first pass to be
frequent.; C2 = pairs with support ≥ s, both chosen from
L1; In general, Ck = k –tuples, each k –1 of which is in Lk -1; Lk
= members of Ck with support ≥ s
20
Bottlenecks of the Apriori approach
• The Apriori algorithm reduces the size of candidate frequent
itemsets by using “Apriori property” - all nonempty subsets of
a frequent itemset must also be frequent.
• However, it still requires two nontrivial computationally
expensive processes.
• It requires as many database scans as the size of the largest
frequent itemsets. In order to find frequent k-itemsets, the
Apriori algorithm needs to scan database k times.
• Breadth-first (i.e., level-wise) search
– Candidate generation and test the frequency of true
appearance of the itemsets.
– It may generate a huge number of candidate sets that will
be discarded later in the test stage. 21
Pattern-Growth Approach
• The FPGrowth Approach
– Depth-first search: search depth wise by identifying different
set of combinations with a given single or pair of items.
– Avoid explicit candidate generation, rather it generates
frequent itemsets.
• Major philosophy: Grow long patterns from short ones using
local frequent items only.
– “abc” is a frequent pattern
– Get all transactions having “abc”, i.e., project DB on abc:
DB|abc
– “d” is a local frequent item in DB|abc → abcd is a frequent
pattern 22
Construct FP-tree from a Transaction Database
Assume min_support = 3 and min_confidence = 80%
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
{}
1.Scan DB once, find Header Table
frequent 1-itemset Item frequency head
f:4 c:1
(single item pattern) f 4
c:3 b:1 b:1
c 4
2.Sort frequent items in a 3
frequency descending b 3 a:3 p:1
order, f-list m 3
p 3 m:2 b:1
3.Scan DB again,
construct FP-tree F-list = f-c-a-b-m-p p:2 m:1 23
FP-Growth Example
• Construct conditional pattern base which consists of the set of prefix
paths in the FP tree co-occuring with the suffix pattern, and then
construct its conditional FP-tree.
25
Exercise
• The below given data is a hypothetical dataset of
transactions with each letter representing an item. Let
the min_support = 3 and min_confidence = 80%.
27
Project (Due: ________)
• Requirement - What you need to do for this project is:
− Choose dataset with 10+ attributes and at least 1000 instances.
As much as possible try to use local data to make analysis easy;
otherwise go to the URL: http://www.kdnuggets.com/datasets/
− Preprocess the dataset if there are any incomplete data, missing
values, outliers & unbalanced.
− Choose at least two algorithms of classification, clustering or
association rule discovery that are implemented in Weka.
− Use the chosen algorithm to run the dataset selected & prepared
• Project Report - Write a publishable report with the following
sections:
− Introduction (the problem, objective & methodology of the
study)
− Review related works
− Data preparation
− Experimental setup (mining method & parameters used for the
experiment)
− Summary of experimental result & findings of the study
− Concluding remarks
− Reference
28