Association Rule Mining Presentation
Association Rule Mining Presentation
X ==> Y
• Where X, Y are itemsets, for example, I = {milk, bread, butter, beer, diapers }
X = { I1, I2 } and Y={ I5 } And a rule could be
{butter, bread} ==> {milk}
• X is called antecedent or left-hand-side (LHS) meaning if butter and bread are bought then
and Y consequent or right-hand-side (RHS) milk is also bought
How to select association rules?
▪ Items that occur more frequently in transactions are more important than others;
▪ rules based on frequently occurring itemsets have better predictive power.
▪ Support and Confidence are two measures based on frequency of itemsets that are used to build association
rules.
• Support: Support is an indication of how frequently the itemset appears in the database
▪ Confidence: It is an indication of how often the rule has be found to be true. It tell us about how likely item
B is purchased when item A is purchased. expressed as {A -> B}
lift(I1→I2)=Confidence(I1→I2)/Support(I2)
▪ If lift = 1, it would imply that the probability of occurrence of the antecedent and that of the consequent are
independent of each other.
▪ When two events are independent of each other no rule can be drawn involving those two events.
The main applications of association rule mining
▪ Market Analysis: For example, if you analyse grocery lists of a consumer over a period of time you will be
able to see a certain buying pattern, like, if peanut butter & jelly are bought then bread is also bought; this
information can be used in marketing and pricing decisions.
▪ Medical Diagnosis: Association rules in medical diagnosis can be useful for assisting physicians for curing
patients. Diagnosis is not an easy process and has a scope of errors which may result in unreliable end-
results. Using relational association rule mining, we can identify the probability of the occurrence of illness
concerning various factors and symptoms.
▪ Census Data: Every government has tonnes of census data. This data can be used to plan efficient public
services(education, health, transport) as well as help public businesses (for setting up new factories,
shopping malls, and even marketing particular products). This application of association rule mining and data
mining has immense potential in supporting sound public policy and bringing forth an efficient functioning of
a democratic society.
• recommendation systems like Amazon, Netflix used them: Another example is Netflix movie
recommendations that are made based on choices made by previous customers. For example, if a movie of
particular genre is selected then similar movie recommendations are made. This type of if/then relationship
is defined by rules using frequency measures like Support and Confidence.
Apriori Algorithm for Association Rule Mining
▪ Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent itemsets in a dataset for
boolean association rule.
▪ Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties.
▪ We apply an iterative approach or level-wise search where k-frequent itemsets are used to find k+1 itemsets.
▪ To improve the efficiency of level-wise generation of frequent itemsets, an important property is used
called Apriori property which helps by reducing the search space.
▪ Apriori Property:
➢ All subsets of a frequent itemset must be frequent(Apriori property).
➢ If an itemset is infrequent, all its supersets will be infrequent
▪ It is designed to work on the databases that contain transactions.
▪ It is mainly used for market basket analysis and helps to understand the products that can be bought together.
• Say, a transaction containing {wine, chips, bread} also contains {wine, bread}.
• So, according to the principle of Apriori, if {wine, chips, bread} is frequent, then {wine, bread} must also be
frequent.
Working example of Apriori Algorithm
Consider the following dataset and we will find frequent itemsets and generate
association rules for them.
Step-1: K=1
(1) Create a table containing support count of each item present in dataset –
Called C1(candidate set)
I1, I2, I3 2
I1, I2, I5 2
itemset generated I1, I3, i5 1 Compare candidate (C3) support count
by joining L2 with minimum support count(here
I2, I3, I4 0
min_support=2 I1, I2, I3 2
I2, I4, I5 0 if support_count of candidate set item is I1, I2, I5 2
I2, I3, I5 1 less than min_support then remove those
items) this gives us itemset L3.
Working example of Apriori Algorithm
Step-4:
▪ Generate candidate set C4 using L3 (join step). Condition of joining Lk-1 and Lk-1 (K=4)
is that, they should have (K-2) elements in common. So here, for L3, first 2 elements
(items) should match.
▪ Check all subsets of these itemsets are frequent or not (Here itemset formed by
joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent).
So no itemset in C4
▪ We stop here because no frequent itemsets are found further
I1, I2, I3 2
I1, I2, I3, I5 2
I1, I2, I5 2
▪ Thus, we have discovered all the frequent item-sets. Now generation of strong association rule comes into
picture. For that we need to calculate confidence of each rule.
▪ Confidence –
A confidence of 60% means that 60% of the customers, who purchased milk and bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)
Working example of Apriori Algorithm
So here, by taking an example of any frequent itemset, let us understand the rule generation.
Consider Itemset {I1, I2, I3} //from L3
So rules can be
▪ [I1,I2]=>[I3] //confidence = sup(I1,I2,I3)/sup(I1,I2) = 2/4*100=50%
▪ [I1,I3]=>[I2] //confidence = sup(I1,I2,I3)/sup(I1,I3) = 2/4*100=50%
▪ [I2,I3]=>[I1] //confidence = sup(I1,I2,I3)/sup(I2,I3) = 2/4*100=50%
▪ [I1]=>[I2,I3] //confidence = sup(I1,I2,I3)/sup(I1) = 2/6*100=33%
▪ [I2]=>[I1,I3] //confidence = sup(I1,I2,I3)/sup(I2) = 2/7*100=28%
▪ [I3]=>[I1,I2] //confidence = sup(I1,I2,I3)/sup(I3) = 2/6*100=33%
Working example of Apriori Algorithm
• Find the frequent item sets in the given table, with minimum support of 2 and confidence 50%.
Step 1: Scan D for count each candidate. The candidate list is {A, B, C, D, E, F} TID List of items
Items Support Count T2000 A, B, C
C1 = {A} 3 T1000 A, C
{B} 2 T4000 A, D
{C} 2 T5000 B, E, F
{D} 1
{E} 1
{F} 1
Items
{A, B} Items Support count
C2 =
{A, C} {A} 3
L1 =
{B, C} {B} 2
{C} 2
Step 4: Scan D for count of each candidate in C2 and find the support
▪ Apriori Algorithm can be slow. Apriori will be very low and inefficient when memory capacity is
limited with large number of transactions
▪ The main limitation is time required to hold a vast number of candidate sets.
▪ to detect frequent pattern in size 100, it have to generate 2^100 candidate itemsets that yield on
costly and wasting of time of candidate generation.
▪ So, it will check for many sets from candidate itemsets, also it will scan database many times
repeatedly for finding candidate itemsets.
FP- Growth Algorithm
FP Tree
• Frequent Pattern Tree is a tree-like structure that is made with the initial itemsets of the
database.
• The purpose of the FP tree is to mine the most frequent pattern.
• Each node of the FP tree represents an item of the itemset.
• The root node represents null while the lower nodes represent the itemsets.
• The association of the nodes with the lower nodes that is the itemsets with the other
itemsets are maintained while forming the tree.
Frequent Pattern Algorithm Steps
#1) The first step is to scan the database to find the occurrences of the itemsets in the database. This step is the
same as the first step of Apriori. The count of 1-itemsets in the database is called support count or frequency of
1-itemset.
#2) The second step is to construct the FP tree. For this, create the root of the tree. The root is represented by
null.
#3) The next step is to scan the database again and examine the transactions. Examine the first transaction and
find out the itemset in it. The itemset with the max count is taken at the top, the next itemset with lower count
and so on. It means that the branch of the tree is constructed with transaction itemsets in descending order of
count.
#4) The next transaction in the database is examined. The itemsets are ordered in descending order of count. If
any itemset of this transaction is already present in another branch (for example in the 1st transaction), then this
transaction branch would share a common prefix to the root.
This means that the common itemset is linked to the new node of another itemset in this transaction.
Frequent Pattern Algorithm Steps
#5) Also, the count of the itemset is incremented as it occurs in the transactions. Both the common
node and new node count is increased by 1 as they are created and linked according to transactions.
#6) The next step is to mine the created FP Tree. For this, the lowest node is examined first along with
the links of the lowest nodes. The lowest node represents the frequency pattern length 1. From this,
traverse the path in the FP Tree. This path or paths are called a conditional pattern base.
Conditional pattern base is a sub-database consisting of prefix paths in the FP tree occurring with the
lowest node (suffix).
#7) Construct a Conditional FP Tree, which is formed by a count of itemsets in the path. The itemsets
meeting the threshold support are considered in the Conditional FP Tree.
#8) Frequent Patterns are generated from the Conditional FP Tree.
Example Of FP-Growth Algorithm Transaction List of items
T1 I1,I2,I3
T2 I2,I3,I4
• Support threshold=50%, Confidence= 60% T3 I4,I5
• Support threshold=50% => 0.5*6= 3 => min_sup=3 T4 I1,I2,I4
1. Count of each item T5 I1,I2,I3,I5
Item Count T6 I1,I2,I3,I4
I1 4 2. Sort the itemset in descending order.
I2 5
Item Count
I3 4
I2 5
I4 4
I1 4
I5 2
I3 4
I4 4
Frequent Pattern Algorithm Steps
3. Build FP Tree
1.Considering the root node null.
2.The first scan of Transaction T1: I1, I2, I3 contains three items {I1:1}, {I2:1}, {I3:1}, where I2 is linked as a child to
root, I1 is linked to I2 and I3 is linked to I1.
3.T2: I2, I3, I4 contains I2, I3, and I4, where I2 is linked to root, I3 is linked to I2 and I4 is linked to I3. But this
branch would share I2 node as common as it is already used in T1.
4.Increment the count of I2 by 1 and I3 is linked as a child to I2, I4 is linked as a child to I3. The count is {I2:2},
{I3:1}, {I4:1}.
5.T3: I4, I5. Similarly, a new branch with I5 is linked to I4 as a child is created.
6.T4: I1, I2, I4. The sequence will be I2, I1, and I4. I2 is already linked to the root node, hence it will be incremented
by 1. Similarly I1 will be incremented by 1 as it is already linked with I2 in T1, thus {I2:3}, {I1:2}, {I4:1}.
7.T5:I1, I2, I3, I5. The sequence will be I2, I1, I3, and I5. Thus {I2:4}, {I1:3}, {I3:2}, {I5:1}.
8.T6: I1, I2, I3, I4. The sequence will be I2, I1, I3, and I4. Thus {I2:5}, {I1:4}, {I3:3}, {I4 1}.
Frequent Pattern Algorithm Steps
TID ITEMS TID ITEMS
Occurrence of A in 1st
items Support A: 1 transaction D A B C
A 3
items Support
B 3 Occurrence of B in 1st
Sort
C 3
D 4 B: 1 transaction D A B C
A 3
D 4
B 3
E 1 X
Occurrence of C in 1st
C 3 C: 1 transaction D A B C
F 1 X
G 1 X
This branch is for 1st transaction D A B C
FP- Growth Algorithm T3 AB
DC
DABC T2
NULL
T1 NULL NULL
NULL
D: 2 A: 1
D: 1 D: 2
C: 1
A: 1 B: 1
A: 1 A: 1 C: 1
B: 1
B: 1 B: 1
C: 1
C: 1 C: 1
FP- Growth Algorithm
T5
T4 DAC
NULL
DB
NULL
D: 3 A: 1
C: 1 D: 4 A: 1
B: 1
A: 2 B: 1
C: 1
A: 2 B: 1
B: 1 C: 1
B: 1 C: 1
C: 1
C: 1
TID ITEMS
NULL
1 DABC
2 DC
3 AB
D: 4 A: 1
B: 1 4 DAC
5 DB
C: 1
A: 2 B: 1
Compressing of
Conditional DB
FP- Growth Algorithm-Another Example
Item Support Count
Asparagus (A) 7
Beans (B) 6
Squash (S) 6
Corn (C) 2
Tomatoes (T) 2
FP Growth Apriori
Pattern Generation
FP growth generates pattern by constructing a FP tree Apriori generates pattern by pairing the items into
singletons, pairs and triplets.
Candidate Generation
There is no candidate generation Apriori uses candidate generation
Process
The process is faster as compared to Apriori. The The process is comparatively slower than FP Growth,
runtime of process increases linearly with increase in the runtime increases exponentially with increase in
number of itemsets. number of itemsets
Memory Usage
A compact version of database is saved The candidates combinations are saved in memory
ECLAT algorithm
• The ECLAT algorithm stands for Equivalence Class Clustering and bottom-up
Lattice Traversal. It is one of the popular methods of Association Rule mining.
• It is a more efficient and scalable version of the Apriori algorithm.
• While the Apriori algorithm works in a horizontal sense imitating the Breadth-
First Search of a graph,
• the ECLAT algorithm works in a vertical manner just like the Depth-First Search
of a graph.
• This vertical approach of the ECLAT algorithm makes it a faster algorithm than
the Apriori algorithm.
• It uses a bit vector. Tree is built on prefix tree, keeps lexicographic ordering.
• Support is done with superposition of transactions
[A,D,E]
NULL
[B,C,D]
[A,C,E]
A:7 B:3 C:7 D:6 E:7
[A,C,D,E] 1011110101 1101010101 1011100111
0100001010 0111011110
[A,E]
[A,C,D] B
A
[B,C] C:3 D:1 E:1
B:0 C:4 D:5 E:6
[A,C,D,E] 0000000000 0011010100 1001010101 1011100101
[C,B,E]
[A,D,E] C D D:4 E:4
E:2
0001000100
• How the algorithm work? :
▪
The basic idea is to use Transaction Id Sets(tidsets) intersections to compute the support value of a
candidate and avoiding the generation of subsets which do not exist in the prefix tree.
▪ In the first call of the function, all single items are used along with their tidsets.
▪ Then the function is called recursively and in each recursive call, each item-tidset pair is verified
and combined with other item-tidset pairs.
Item Tidset
Item Tidset
Bread {T1, T4, T5, T7, T8, T9}
Bread {T1, T4, T5, T7, T8, T9}
Butter {T1, T2, T3, T4, T6, T8, T9}
Butter {T1, T2, T3, T4, T6, T8, T9}
Milk {T3, T5, T6, T7, T8, T9} Milk {T3, T5, T6, T7, T8, T9}
Item Tidset
Bread Butter
Bread Milk
Bread Jam
Butter Milk
Butter Jam