0% found this document useful (0 votes)
45 views24 pages

Association Rule Mining

Specific to ML technique Association rules
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views24 pages

Association Rule Mining

Specific to ML technique Association rules
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Data Mining

Association Rule Mining

Adapted from
Data Mining Concepts and Techniques by
Han, Kamber & Pei

1
Outline

Basic Concepts

Frequent Itemset Mining using Apriori Algorithm

Which Patterns Are Interesting?

Demonstration of frequent item-set mining and

generating association rules using Python

2
What Is Association Rule Mining

Imagine that you are sales manager at All


Electronics store and you are talking to the
customer who recently bought a PC and digital
camera from the store
What should you recommend to her next?
Information about which products are frequently
purchased together by your customer may be
very useful for recommendation
Frequent patterns and association rules are the
knowledge that you want to mine in such a
scenario
3
What is Association Rule Mining

Association rule mining is machine learning


technique used to find association/relation
between items/products/variables from large
transactional data sets
This helps us to understand customers buying
habits
To find association between items we first need
to find frequent itemsets from transactional data

4
What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that appear frequently in a data set
e.g. A set of items, such as milk and bread , that appear frequently
together in a transaction data set is a frequent itemset
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
Frequent itemset mining (finding frequent patterns) lead to the discovery
of associations and correlations among items in large transactional data
sets
A typical example of frequent itemset mining is market basket analysis.
It is process of analyzing customer buying habits by finding association
between the different items that customer place in their shopping basket
Frequent patterns are presented in the form of association rules 5
Market Basket Analysis

6
Applications
The discovery of interesting correlations among huge
amount of transaction data helps in business decision
making processes such as catalog design, cross-
marketing, customer shopping behavior
Products that are frequently purchased together can be
bundled together and discount can be offered to increase
the sale
Design store layout
Strategy 1: Items that are purchased together can be
placed in proximity
Strategy 2: At opposite ends – customers who
purchase such items to pick up other items along the
way
7
Basic Concepts: Frequent Patterns
Itemset is a set of items, and itemset
Tid Items bought
that contains k items is called as k-
10 Beer, Nuts, Diaper itemset
20 Beer, Coffee, Diaper k-itemset X = {x1, …, xk}
30 Beer, Diaper, Eggs absolute support, or, support count of
40 Nuts, Eggs, Milk X: Frequency of occurrence of an
itemset X
50 Nuts, Coffee, Diaper, Eggs, Milk
{Beer,Diaper} support count is 3
Customer Customer relative support or support, s, is the
buys both buys diaper fraction of transactions that contains X
(i.e., the probability that a transaction
contains X)
{Beer,Diaper} support is 3/5
An itemset X is frequent if X’s support
Customer
is not less than a minsup threshold
buys beer

8
Basic Concepts: Association Rules
Frequent patterns are represented in the form of rules
Support and confidence are the two measures of rule interestingness.
association rules are represented as follows:
X Y Support , Confidence
support, s, probability that a transaction contains X ∪ Y
confidence, c, conditional probability that a transaction having X also
contains Y
e.g. Diaper Beer [support =60%, confidence=75%]
Support is percentage of the transactions that contains both X and Y
(Diaper and Beer) . e.g. A support value 60% means that 60% of all
the transactions under analysis show that beer and diaper are
purchased together
Confidence is the percentage of transactions containing X that also
contain Y. e.g. A confidence value 75% means that 75% of the
customers who purchased diaper also bought the beer
9
Basic Concepts: Association Rules
Tid Items bought
Example :
10 Beer, Nuts, Diaper
Association rules
20 Beer, Coffee, Diaper
30 Beer, Diaper, Eggs
Beer Diaper (60%, 100%)
40 Nuts, Eggs, Milk Diaper Beer (60%, 75%)
50 Nuts, Coffee, Diaper, Eggs, Milk

Support (Beer->Diaper)=No. of Support signifies how popular an itemset is


transaction containing both
and
{Beer, Diaper} / Total No. of
transactions = 3/5 = 60%=0.60 Confidence signifies the likelihood of item Y
being purchased when item X is
confidence(Beer->Diaper)=No. purchased
of transaction containing both
{Beer, Diaper} / No. of
transactions containing Beer=
3/3 =100%
10
Basic Concepts: Association Rules

Association rule mining can be viewed as a two


step process :
Find all frequent itemsets (k-itemset that are
frequently purchase together)
Generate strong rules from frequent itemsets

Apriori is seminal algorithm proposed by Agrawal


and Srikant in 1994 for mining frequent itemsets

11
Apriori: A Candidate Generation & Test Approach

Apriori pruning principle: If there is any itemset which is


infrequent, its superset should not be generated/tested!
Apriori algorithm employs an iterative approach where k
itemsets are used to generate k+1 itemset
Method:
Initially, scan DB once to get frequent 1-itemset
Generate frequent 2-itemset
Generate frequent 3 itemset

Terminate when no frequent candidate set can be
generated 12
The Apriori Algorithm—An Example
Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
1st scan {C} 3
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3
40 B, E
C2 Itemset sup C2 Itemset
{A, B} 1
L2 Itemset sup
{A, C} 2
2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2
{B, C} 2 {A, E}
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset L3 Itemset sup


3rd scan
{B, C, E} {B, C, E} 2
13
Apriori Algorithm

C3=L2 join L2
={{I1,I2,I3}{I1,I2,I5}
{I1,I3,I5},{I2,I3,I4}
{I2,I3,I5}{I2,I4,I5}}

Subsets of {I1,I3,I5}=
{I1, I3, I5,{I1,I3},{I1,I5},
{I3,I5}

Itemset is called as frequent itemset if it satisfies minimum support threshold


condition and all its non empty subsets are also frequent
The Apriori Algorithm (Pseudo-Code)
Ck: Candidate itemset of size k
Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1 that
are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
15
Implementation of Apriori

How to generate candidates?


Step 1: self-joining Lk
Step 2: pruning
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4 = {abcd}
16
Generating Association rules

For each frequent itemset l, generate all


nonempty subsets of l

For every nonempty subset s of l, generate


rule : s => (l-s) if
support_count (l) / support_count (s) >=
min-conf threshold

As rules are generated from frequent


itemsets each one automatically satisfies the
minimum support threshold
Generating Association rules example

Consider frequent itemset X= {I1, I2, I5}

Step 1: non empty subsets of X are: {I1, I2} ,


{I1, I5}, {I2, I5}, {I1}, {I2}, {I5}
Step 2: Generate Rules
Misleading rules

Support=4000/10000=40%

Confidence=4000/6000=66%
Misleading rules cont …

The rule is misleading because probability of


purchasing video is 75% which is greater than
confidence of the rule (66%)
Computer games and video are negatively
correlated because purchase of one of these
items decreases the likelihood of purchasing the
other.
Use another measure lift
lift

Lift is measure of correlation


Lift cont…
It assesses the degree to which occurrence of one
(e.g. A) lifts the occurrence of the other (e. g. B)
if lift(A->B)=1 then occurrence of A is independent of
occurrence of B. No association between items.
if lift(A->B)<1 then occurrence of A is negatively
correlated with occurrence of B i.e. occurrence of A
decreases chances of occurrence of B by the factor of
lift(A->B).
if lift(A->B)>1 then occurrence of A is positively
correlated with occurrence of B i.e. occurrence of A
increases chances of occurrence of B by the factor of
lift(A->B)
Leverage:
leverage(X→Y)=support(X→Y)−support(X)*support(Y)
Leverage measures the difference between observed and
expected joint probability of X and Y
Leverage value “0” indicates that occurrence of X and Y is
independent of each other
Conviction:
conviction(X -> Y) =(1-supp(Y))/(1-conf(X -> Y)) = P(X)P(not
Y)/P(X and not Y)
Conviction compares the probability that X appears without Y if
they were dependent, with the actual frequency of the
appearance of X without Y.
Leverage value “1” indicates that occurrence of X and Y is
independent of each other

23
In most cases, it is sufficient to focus on a
combination of support, confidence, and either lift
or leverage to quantitatively measure the
"quality" of the rule. However, the real value of a
rule, in terms of usefulness and action ability is
subjective and depends heavily of the particular
domain and business objectives.

24

You might also like