0% found this document useful (0 votes)
124 views44 pages

Association Rule Mining Presentation

The document discusses association rule mining, which finds interesting relationships between large datasets. It defines key terms like support, confidence and lift used to identify important rules. The apriori algorithm is also explained as a popular method for generating association rules.

Uploaded by

Dev Mangal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views44 pages

Association Rule Mining Presentation

The document discusses association rule mining, which finds interesting relationships between large datasets. It defines key terms like support, confidence and lift used to identify important rules. The apriori algorithm is also explained as a popular method for generating association rules.

Uploaded by

Dev Mangal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Association Rule Mining

Dr. Pankaj Agarwal, Professor,


Department of Computer Science & Engineering,
The NorthCap University
What is Association Rule Mining?

▪ Association rule mining finds interesting associations and relationships


among large sets of data items;
▪ This rule shows how frequently a itemset occurs in a transaction;
▪ Given a set of transactions, association rule mining aims to find the rules
which enable us to predict the occurrence of a specific item based on the
occurrences of the other items in the transaction.
▪ it is employed in Market Basket analysis, Web usage mining, continuous
production, etc.
What is Association Rule?
Association rules are applied to large databases with hundreds of items and several
thousands of transactions

transaction ID milk bread butter beer diapers


• Set of Items: I = {I1, I2,I3,..., In} 1 1 1 0 0 0
• Set of transactions: T = {T1, T2, ..., Tn} 2 0 0 1 0 0
• Each transaction has a unique id and contains subset of 3 0 0 0 1 1
items 4 1 1 1 0 0
• A rule is defined as an implication of the form: 5 0 1 0 0 0

X ==> Y
• Where X, Y are itemsets, for example, I = {milk, bread, butter, beer, diapers }
X = { I1, I2 } and Y={ I5 } And a rule could be
{butter, bread} ==> {milk}
• X is called antecedent or left-hand-side (LHS) meaning if butter and bread are bought then
and Y consequent or right-hand-side (RHS) milk is also bought
How to select association rules?
▪ Items that occur more frequently in transactions are more important than others;
▪ rules based on frequently occurring itemsets have better predictive power.
▪ Support and Confidence are two measures based on frequency of itemsets that are used to build association
rules.
• Support: Support is an indication of how frequently the itemset appears in the database

Total number of transaction = 5


Number of times itemset {beer, diapers} appears = 1
Support of {beer, diapers} = 1/5 = 0.2

▪ Confidence: It is an indication of how often the rule has be found to be true. It tell us about how likely item
B is purchased when item A is purchased. expressed as {A -> B}

For rule {butter, bread} ==> {milk}

supp(XUY) = support of {butter, bread, milk} = 1/5 = 0.2


supp(X) = support of {butter,bread} = 1/5 = 0.2
confidence of the rule = 0.2/0.2 = 1
Conviction
• Conviction of a rule can be defined as follows:
How to select association rules?
Lift: Lift is the ratio of confidence and support. It tell us about how likely the item B is purchased when the item A
is purchased while controlling for how popular item B is.

lift(I1→I2)=Confidence(I1→I2)/Support(I2)

For rule {milk, bread} ==> {butter}


supp(XUY) = support of {milk, bread, butter} = 1/5 = 0.2
supp(X) = suppport of {milk,bread} = 2/5 = 0.4
supp(Y) = support of {butter} = 2/5 = 0.4
lift of rule = 0.2/0.4*0.4 = 1.25

▪ If lift = 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are
independent of each other.

▪ When two events are independent of each other no rule can be drawn involving those two events.
The main applications of association rule mining

▪ Market Analysis: For example, if you analyse grocery lists of a consumer over a period of time you will be
able to see a certain buying pattern, like, if peanut butter & jelly are bought then bread is also bought; this
information can be used in marketing and pricing decisions.
▪ Medical Diagnosis: Association rules in medical diagnosis can be useful for assisting physicians for curing
patients. Diagnosis is not an easy process and has a scope of errors which may result in unreliable end-
results. Using relational association rule mining, we can identify the probability of the occurrence of illness
concerning various factors and symptoms.
▪ Census Data: Every government has tonnes of census data. This data can be used to plan efficient public
services(education, health, transport) as well as help public businesses (for setting up new factories,
shopping malls, and even marketing particular products). This application of association rule mining and data
mining has immense potential in supporting sound public policy and bringing forth an efficient functioning of
a democratic society.
• recommendation systems like Amazon, Netflix used them: Another example is Netflix movie
recommendations that are made based on choices made by previous customers. For example, if a movie of
particular genre is selected then similar movie recommendations are made. This type of if/then relationship
is defined by rules using frequency measures like Support and Confidence.
Apriori Algorithm for Association Rule Mining

▪ Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a dataset for
boolean association rule.
▪ Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties.
▪ We apply an iterative approach or level-wise search where k-frequent itemsets are used to find k+1 itemsets.
▪ To improve the efficiency of level-wise generation of frequent itemsets, an important property is used
called Apriori property which helps by reducing the search space.
▪ Apriori Property:
➢ All subsets of a frequent itemset must be frequent(Apriori property).
➢ If an itemset is infrequent, all its supersets will be infrequent
▪ It is designed to work on the databases that contain transactions.
▪ It is mainly used for market basket analysis and helps to understand the products that can be bought together.
• Say, a transaction containing {wine, chips, bread} also contains {wine, bread}.
• So, according to the principle of Apriori, if {wine, chips, bread} is frequent, then {wine, bread} must also be
frequent.
Working example of Apriori Algorithm
Consider the following dataset and we will find frequent itemsets and generate
association rules for them.

minimum support count is 2


minimum confidence is 60%

Step-1: K=1
(1) Create a table containing support count of each item present in dataset –
Called C1(candidate set)

(2) Compare candidate set item’s support count with minimum


support count(here min_support=2 )

if support_count of candidate set items is less than min_support


then remove those items). This gives us itemset L1.
Working example of Apriori Algorithm
Step-2: K=2
▪ Generate candidate set C2 using L1 (this is called join step). Condition of joining is
that it should have (K-2) elements in common.
▪ Check all subsets of an itemset are frequent or not and if not frequent remove that
itemset.(Example subset of{I1, I2} are {I1}, {I2} they are frequent.
▪ Check for each itemset)
▪ Now find support count of these itemsets by searching in dataset.

Compare candidate (C2) support count


with minimum support count(here
min_support=2

if support_count of candidate set item is itemset L2


less than min_support then remove those
items) this gives us itemset L2.
Working example of Apriori Algorithm
Step-3:
▪ Generate candidate set C3 using L2 (join step). Condition of joining Lk-1 and Lk-1 is
that it should have (K-2) elements in common. So here, for L2, first element should
match.
So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, i5}{I2, I3,
I4}{I2, I4, I5}{I2, I3, I5}
▪ Check if all subsets of these itemsets are frequent or not and if not, then remove that
itemset
▪ find support count of these remaining itemset by searching in dataset.

I1, I2, I3 2

I1, I2, I5 2
itemset generated I1, I3, i5 1 Compare candidate (C3) support count
by joining L2 with minimum support count(here
I2, I3, I4 0
min_support=2 I1, I2, I3 2
I2, I4, I5 0 if support_count of candidate set item is I1, I2, I5 2
I2, I3, I5 1 less than min_support then remove those
items) this gives us itemset L3.
Working example of Apriori Algorithm
Step-4:
▪ Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4)
is that, they should have (K-2) elements in common. So here, for L3, first 2 elements
(items) should match.
▪ Check all subsets of these itemsets are frequent or not (Here itemset formed by
joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent).
So no itemset in C4
▪ We stop here because no frequent itemsets are found further
I1, I2, I3 2
I1, I2, I3, I5 2
I1, I2, I5 2

▪ Thus, we have discovered all the frequent item-sets. Now generation of strong association rule comes into
picture. For that we need to calculate confidence of each rule.
▪ Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought butter.

Confidence(A->B)=Support_count(A∪B)/Support_count(A)
Working example of Apriori Algorithm

So here, by taking an example of any frequent itemset, let us understand the rule generation.
Consider Itemset {I1, I2, I3} //from L3

So rules can be
▪ [I1,I2]=>[I3] //confidence = sup(I1,I2,I3)/sup(I1,I2) = 2/4*100=50%
▪ [I1,I3]=>[I2] //confidence = sup(I1,I2,I3)/sup(I1,I3) = 2/4*100=50%
▪ [I2,I3]=>[I1] //confidence = sup(I1,I2,I3)/sup(I2,I3) = 2/4*100=50%
▪ [I1]=>[I2,I3] //confidence = sup(I1,I2,I3)/sup(I1) = 2/6*100=33%
▪ [I2]=>[I1,I3] //confidence = sup(I1,I2,I3)/sup(I2) = 2/7*100=28%
▪ [I3]=>[I1,I2] //confidence = sup(I1,I2,I3)/sup(I3) = 2/6*100=33%
Working example of Apriori Algorithm
• Find the frequent item sets in the given table, with minimum support of 2 and confidence 50%.

Step 1: Scan D for count each candidate. The candidate list is {A, B, C, D, E, F} TID List of items
Items Support Count T2000 A, B, C
C1 = {A} 3 T1000 A, C
{B} 2 T4000 A, D
{C} 2 T5000 B, E, F
{D} 1
{E} 1
{F} 1

Step 2: Compare the candidate support with minimum support count of 2

Items Support count


L1 = {A} 3
{B} 2
{C} 2
Working example of Apriori Algorithm
Step 3: Generate candidate C2 from L1

Items
{A, B} Items Support count
C2 =
{A, C} {A} 3
L1 =
{B, C} {B} 2
{C} 2
Step 4: Scan D for count of each candidate in C2 and find the support

Items Support Count


{A, B} 1
C2 =
{A, C} 2
{B, C} 1

Step 5: Compare candidate (C2) support count with the minimum


Items Support
support count. L2 =
{A,C} 2
Step 6: Data contains the frequent item 1 (A, C), so that the association rule that can be
generated from 'L' are as shown in the following table with the support and confidence.

Association Support Confidence Confidence % TID List of items


Rule T2000 A, B, C
A->C 2 2/3 = 0.66 66 % T1000 A, C
T4000 A, D
C->A 2 2/2 = 1 100 %
T5000 B, E, F

So the final rules are:


Rule 1: A - > C
Rule 2: C - > A
Limitations of Apriori Algorithm

▪ Apriori Algorithm can be slow. Apriori will be very low and inefficient when memory capacity is
limited with large number of transactions
▪ The main limitation is time required to hold a vast number of candidate sets.
▪ to detect frequent pattern in size 100, it have to generate 2^100 candidate itemsets that yield on
costly and wasting of time of candidate generation.
▪ So, it will check for many sets from candidate itemsets, also it will scan database many times
repeatedly for finding candidate itemsets.
FP- Growth Algorithm

• This algorithm is an improvement to the Apriori method.


• A frequent pattern is generated without the need for candidate generation.
• FP growth algorithm represents the database in the form of a tree called a frequent pattern tree or FP tree.
• Apriori algorithm generates all itemsets by scanning the full transactional database.
• Whereas the FP growth algorithm only generates the frequent itemsets according to the minimum support defined
by the user.
• Since Apriori scans the whole database multiple times, it Is more resource-hungry and the time to generate the
association rules increases exponentially with the increase in the database size.
• On the other hand, the FP growth algorithm doesn’t scan the whole database multiple times and the scanning time
increases linearly.
• Hence, the FP growth algorithm is much faster than the Apriori algorithm.
• This tree structure will maintain the association between the itemsets.
• The database is fragmented using one frequent item.
• This fragmented part is called “pattern fragment”. The itemsets of these fragmented patterns are analyzed. Thus with
this method, the search for frequent itemsets is reduced comparatively.
FP- Growth Algorithm

FP Tree
• Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the
database.
• The purpose of the FP tree is to mine the most frequent pattern.
• Each node of the FP tree represents an item of the itemset.
• The root node represents null while the lower nodes represent the itemsets.
• The association of the nodes with the lower nodes that is the itemsets with the other
itemsets are maintained while forming the tree.
Frequent Pattern Algorithm Steps

#1) The first step is to scan the database to find the occurrences of the itemsets in the database. This step is the
same as the first step of Apriori. The count of 1-itemsets in the database is called support count or frequency of
1-itemset.
#2) The second step is to construct the FP tree. For this, create the root of the tree. The root is represented by
null.
#3) The next step is to scan the database again and examine the transactions. Examine the first transaction and
find out the itemset in it. The itemset with the max count is taken at the top, the next itemset with lower count
and so on. It means that the branch of the tree is constructed with transaction itemsets in descending order of
count.
#4) The next transaction in the database is examined. The itemsets are ordered in descending order of count. If
any itemset of this transaction is already present in another branch (for example in the 1st transaction), then this
transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this transaction.
Frequent Pattern Algorithm Steps

#5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the common
node and new node count is increased by 1 as they are created and linked according to transactions.
#6) The next step is to mine the created FP Tree. For this, the lowest node is examined first along with
the links of the lowest nodes. The lowest node represents the frequency pattern length 1. From this,
traverse the path in the FP Tree. This path or paths are called a conditional pattern base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the
lowest node (suffix).
#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The itemsets
meeting the threshold support are considered in the Conditional FP Tree.
#8) Frequent Patterns are generated from the Conditional FP Tree.
Example Of FP-Growth Algorithm Transaction List of items
T1 I1,I2,I3
T2 I2,I3,I4
• Support threshold=50%, Confidence= 60% T3 I4,I5
• Support threshold=50% => 0.5*6= 3 => min_sup=3 T4 I1,I2,I4
1. Count of each item T5 I1,I2,I3,I5
Item Count T6 I1,I2,I3,I4
I1 4 2. Sort the itemset in descending order.
I2 5
Item Count
I3 4
I2 5
I4 4
I1 4
I5 2
I3 4
I4 4
Frequent Pattern Algorithm Steps
3. Build FP Tree
1.Considering the root node null.
2.The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1}, {I3:1}, where I2 is linked as a child to
root, I1 is linked to I2 and I3 is linked to I1.
3.T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to I2 and I4 is linked to I3. But this
branch would share I2 node as common as it is already used in T1.
4.Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as a child to I3. The count is {I2:2},
{I3:1}, {I4:1}.
5.T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6.T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node, hence it will be incremented
by 1. Similarly I1 will be incremented by 1 as it is already linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
7.T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2}, {I5:1}.
8.T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4 1}.
Frequent Pattern Algorithm Steps
TID ITEMS TID ITEMS

1 BDCA 1 DABC NULL


2 EDC 2 DC (E’s support
count<2)
3 AB
3 AB Occurrence of D in 1st
4 ACD D: 1 transaction D A B C
4 DAC
5 FGDB
5 DB (F,G removed)

Occurrence of A in 1st
items Support A: 1 transaction D A B C
A 3
items Support
B 3 Occurrence of B in 1st
Sort
C 3
D 4 B: 1 transaction D A B C
A 3
D 4
B 3
E 1 X
Occurrence of C in 1st
C 3 C: 1 transaction D A B C
F 1 X

G 1 X
This branch is for 1st transaction D A B C
FP- Growth Algorithm T3 AB
DC
DABC T2
NULL
T1 NULL NULL
NULL

D: 2 A: 1
D: 1 D: 2

C: 1
A: 1 B: 1
A: 1 A: 1 C: 1

B: 1
B: 1 B: 1

C: 1
C: 1 C: 1
FP- Growth Algorithm
T5
T4 DAC
NULL
DB

NULL

D: 3 A: 1

C: 1 D: 4 A: 1
B: 1
A: 2 B: 1

C: 1
A: 2 B: 1
B: 1 C: 1

B: 1 C: 1

C: 1

C: 1
TID ITEMS
NULL
1 DABC

2 DC

3 AB
D: 4 A: 1
B: 1 4 DAC

5 DB
C: 1
A: 2 B: 1

End with Paths Count of each Candidate itemset with Frequent


item in path count of each w.r.t itemset
B: 1 C: 1 transition table
C DAB:1 D:3 DC:3 DC
DA:1 A:2 AC:2 AC
D:1 B:1 DAC:2 DAC
C:3 C
B DA:1 D:2 DB:2 DB
C: 1 D:1 A:2 B:3 B
A:1 DAB:1
FP- Growth Algorithm-Another Example

List of items in the


Transaction ID
transaction
T1 B,A,T
T2 A,C
Item Support Count
T3 A,S Asparagus (A) 7
T4 B,A,C Beans (B) 6
T5 B,S Squash (S) 6
Corn (C) 2
T6 A,S
Tomatoes (T) 2
T7 B,S
T8 B,A,S,T
T9 B,A,S
FP- Growth Algorithm-Another Example
Item Support Count
Asparagus (A) 7
Beans (B) 6
Squash (S) 6
Corn (C) 2
Tomatoes (T) 2
FP- Growth Algorithm-Another Example
Item Support Count
Asparagus (A) 7
Beans (B) 6
Squash (S) 6
Corn (C) 2
Tomatoes (T) 2
FP- Growth Algorithm-Another Example
Item Support Count
Asparagus (A) 7
Beans (B) 6
Squash (S) 6
Corn (C) 2
Tomatoes (T) 2
FP- Growth Algorithm-Another Example
Item Support Count
Asparagus (A) 7
Beans (B) 6
Squash (S) 6
Corn (C) 2
Tomatoes (T) 2

Compressing of
Conditional DB
FP- Growth Algorithm-Another Example
Item Support Count
Asparagus (A) 7
Beans (B) 6
Squash (S) 6
Corn (C) 2
Tomatoes (T) 2

Item Conditional Pattern base Conditional FP tree Frequent Pattern Generation


Tomatoes (T) {{A,B:1},{A,B,S:1}} <A:2,B:2> {A,T:2},{B,T:2},{A,B,T:2}
Corn (C) {{A,B:1},{A:1}} <A:2> {A,C:2}
Squash (S) {{A,B:2},{A:2},{B:2}} <A:4,B:2>,<B:2> {A,S:4},{B,S:4},{A,B,S:2}
Bean (B) {{A:4}} <A:4> {A,B:4}
Advantages Of FP Growth Algorithm
1. This algorithm needs to scan the database only twice when compared to Apriori
which scans the transactions for each iteration.
2.The pairing of items is not done in this algorithm and this makes it faster.
3.The database is stored in a compact version in memory.
4.It is efficient and scalable for mining both long and short frequent patterns.
Disadvantages Of FP-Growth Algorithm
1. FP Tree is more cumbersome and difficult to build than Apriori.
2.It may be expensive.
3.When the database is large, the algorithm may not fit in the shared memory.
FP Growth vs Apriori

FP Growth Apriori
Pattern Generation
FP growth generates pattern by constructing a FP tree Apriori generates pattern by pairing the items into
singletons, pairs and triplets.
Candidate Generation
There is no candidate generation Apriori uses candidate generation
Process
The process is faster as compared to Apriori. The The process is comparatively slower than FP Growth,
runtime of process increases linearly with increase in the runtime increases exponentially with increase in
number of itemsets. number of itemsets

Memory Usage
A compact version of database is saved The candidates combinations are saved in memory
ECLAT algorithm

• The ECLAT algorithm stands for Equivalence Class Clustering and bottom-up
Lattice Traversal. It is one of the popular methods of Association Rule mining.
• It is a more efficient and scalable version of the Apriori algorithm.
• While the Apriori algorithm works in a horizontal sense imitating the Breadth-
First Search of a graph,
• the ECLAT algorithm works in a vertical manner just like the Depth-First Search
of a graph.
• This vertical approach of the ECLAT algorithm makes it a faster algorithm than
the Apriori algorithm.
• It uses a bit vector. Tree is built on prefix tree, keeps lexicographic ordering.
• Support is done with superposition of transactions
[A,D,E]
NULL
[B,C,D]
[A,C,E]
A:7 B:3 C:7 D:6 E:7
[A,C,D,E] 1011110101 1101010101 1011100111
0100001010 0111011110
[A,E]
[A,C,D] B
A
[B,C] C:3 D:1 E:1
B:0 C:4 D:5 E:6
[A,C,D,E] 0000000000 0011010100 1001010101 1011100101
[C,B,E]
[A,D,E] C D D:4 E:4

D:3 E:3 E:4


0001010100 0011000100 E:2
D

E:2
0001000100
• How the algorithm work? :


The basic idea is to use Transaction Id Sets(tidsets) intersections to compute the support value of a
candidate and avoiding the generation of subsets which do not exist in the prefix tree.

▪ In the first call of the function, all single items are used along with their tidsets.

▪ Then the function is called recursively and in each recursive call, each item-tidset pair is verified
and combined with other item-tidset pairs.

▪ This process is continued until no candidate item-tidset pairs can be combined.


Advantages over Apriori algorithm:-
1. Memory Requirements: Since the ECLAT algorithm uses a Depth-First
Search approach, it uses less memory than Apriori algorithm.
2. Speed: The ECLAT algorithm is typically faster than the Apriori
algorithm.
3. Number of Computations: The ECLAT algorithm does not involve the
repeated scanning of the data to compute the individual support
values.
k = 1, minimum support = 2

Item Tidset
Item Tidset
Bread {T1, T4, T5, T7, T8, T9}
Bread {T1, T4, T5, T7, T8, T9}
Butter {T1, T2, T3, T4, T6, T8, T9}
Butter {T1, T2, T3, T4, T6, T8, T9}

Milk {T3, T5, T6, T7, T8, T9} Milk {T3, T5, T6, T7, T8, T9}

Coke {T2, T4} Coke {T2, T4}

Jam {T1, T8}


Jam {T1, T8}
We now recursively call the function till no
more item-tidset pairs can be combined:-
k=2
k=3
Item Tidset

{Bread, Butter} {T1, T4, T8, T9}


Item Tidset
{Bread, Milk} {T5, T7, T8, T9}
{Bread, Butter, Milk} {T8, T9}
{Bread, Coke} {T4}
{Bread, Butter, Jam} {T1, T8}
{Bread, Jam} {T1, T8}

{Butter, Milk} {T3, T6, T8, T9}

{Butter, Coke} {T2, T4}

{Butter, Jam} {T1, T8}

{Milk, Jam} {T8} k=4

Item Tidset

{Bread, Butter, Milk, Jam} {T8}


We stop at k = 4 because there are no more item-tidset pairs to combine.
Since minimum support = 2, we conclude the following rules from the given dataset:-

Items Bought Recommended Products

Bread Butter

Bread Milk

Bread Jam

Butter Milk

Butter Jam

Bread and Butter Milk

Bread and Butter Jam


Thank You

For any queries, reach at:


[email protected]
Introduction to Association Rule Mining🍅 | Kaggle

You might also like