0% found this document useful (0 votes)
8 views

Module 2

ffffff

Uploaded by

sandeepjose2004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module 2

ffffff

Uploaded by

sandeepjose2004
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

󾠯

Module 2
Association Rules
Frequent Patterns: The patterns that appear frequently in a Dataset.

Association rule mining is a technique used to identify patterns in large data sets. It
involves finding relationships between variables in the data and using those
relationships to make predictions or decisions. The goal of association rule mining is
to uncover rules that describe the relationships between different items in the data
set.
Association rules are if/then statements that help uncover relationships between
seemingly unrelated data in a relational database or another information repository.
An example of an association rule would be “If a customer buys a dozen eggs, he is
80% likely to also purchase milk.”(Market Basket Analysis)

How does Association Rule Learning work?


Association rule learning works on the concept of If and Else Statements, such as if
A then B.

The If element is called antecedent, and then the statement is called Consequent.
These types of relationships where we can find out some association or relation
between two items are known as single cardinality. It is all about creating rules, and
if the number of items increases, then cardinality also increases accordingly. So, to
measure the associations between thousands of data items, there are several
metrics.

Module 2 1
The metrics are :

Support
Support is the frequency of A or how frequently an item appears in the dataset. It
is defined as the fraction of the transaction T that contains the item-set X. If there
are X datasets, then for transactions T, it can be written as:

Confidence
Confidence indicates how often the rule has been found to be true. Or how often
the items X and Y occur together in the dataset when the occurrence of X is
already given. It is the ratio of the transaction that contains X and Y to the
number of records that contain X. or it measures the likelihood of an associated
item being purchased when the antecedent item is purchased. It is calculated as
the proportion of transactions containing the antecedent item in which the
associated item also appears.

OR

support(x, y)
Confidence =
support(x)

Lift
It is the strength of any rule, which can be defined as below formula:

It is the ratio of the observed support measure and expected support if X and Y
are independent of each other. It has three possible values:

Module 2 2
If Lift= 1: The probability of occurrence of antecedent and consequent is
independent of each other.

Lift>1: It determines the degree to which the two itemsets are dependent on
each other.

Lift<1: It tells us that one item is a substitute for other items, which means
one item has a negative effect on another.

Market Basket Analysis

Market basket analysis is a technique used in data mining and retail analytics to
identify relationships and patterns in customer purchasing behavior. It involves
analyzing transactional data, typically from point-of-sale systems, to uncover
associations between products that are frequently purchased together. The goal
is to understand the co-occurrence of items in a customer's shopping basket and
to provide insights that can be used for various purposes, such as product
recommendations, store layout optimization, and targeted marketing strategies.

Techniques Used in Market Basket Analysis

Module 2 3
Terms in analysis

Itemset (I): It refers to a collection of items represented as a set {I1,


I2, ..., Im}.

Database transactions (D): It is a set of transactions, where each


transaction (T) is a nonempty itemset and T is a subset of the itemset
I. Each transaction is associated with a unique identifier called TID.

Association rule: An association rule is an implication of form A B,⇒


where A and B are itemsets. A and B are subsets of I, and both A and
B are non-empty. Additionally, A and B should have no common items
(A ∩ B = φ). Association rules describe relationships between items
in transactions.

Support (s): The support of an association rule A ⇒ B is the


percentage of transactions in the database (D) that contain the union
of sets A and B (A ∪ B). It is also considered as the probability, P(A
∪ B).

Confidence (c): The confidence of an association rule A ⇒ B is the


percentage of transactions in D containing A that also contain B. It is
the conditional probability, P(B|A).

Strong rules: Association rules that satisfy both a minimum support


threshold (min sup) and a minimum confidence threshold (min conf)
are considered strong. These thresholds are set based on the desired
significance level.

Occurrence frequency: The occurrence frequency of an itemset is the


count of transactions in D that contain the itemset. It is also referred
to as the support count or count of the itemset.

Frequent itemsets: Itemsets that have relative support (calculated as


the proportion of transactions) satisfying a minimum support threshold
are considered frequent itemsets. The set of frequent k-itemsets is
denoted by Lk

Pruning: Pruning techniques can be applied to remove uninteresting


or redundant rules, enhancing the quality and interpretability of the
generated rules.

Module 2 4
Evaluation and Selection: The generated rules are evaluated based
on measures such as support, confidence, lift, and other metrics. The
selection of rules is based on the desired quality and significance
criteria.

Interpretation and Application: The discovered association rules


provide insights into patterns and relationships among items in the
transactional data. They can be utilized for various purposes, such as
product recommendations, cross-selling, pricing strategies, and
targeted marketing campaigns.

Frequent Item-sets
frequent itemsets refer to sets of items that frequently occur together in
transactions above a specified minimum support threshold. The support of
an itemset is the proportion of transactions that contain all the items in the
set. By identifying frequent itemsets, retailers can uncover patterns and
associations among items that are commonly purchased together.

For example: if the minimum support threshold is set to 5%, an itemset


containing "bread" and "milk" that appears in 7% of all transactions would
be considered a frequent itemset.

Frequent itemsets are typically discovered using algorithms like the Apriori
algorithm or the FP-Growth algorithm, which efficiently traverses the
transactional dataset to find itemsets that meet the minimum support
criteria.

Closed Item-sets
Closed itemsets are a specific type of frequent itemsets that do not have
any supersets with the same support. In other words, a closed itemset is
an itemset for which there is no other itemset containing the same items

Module 2 5
but with higher support. Closed itemsets capture the essential
associations without redundancy.

For example, if the itemset {A, B, C} has a support of 0.1, and there is no
other itemset that contains {A, B, C} with a support of 0.1 or higher, then
{A, B, C} is a closed itemset.
Closed itemsets are useful because they provide a more concise
representation of frequent itemsets and simplify the interpretation of
association rules.

Association rule mining


Association rule mining involves two main steps. First, all frequent
itemsets are identified by applying a minimum support threshold. Then,
strong association rules are generated from the frequent itemsets by
considering the minimum confidence threshold.

Apriori Algorithm
Finding Frequent Itemset by Confined Candidate
Generation

Consider the following dataset and we will find frequent itemsets


and generate association rules for them.

Module 2 6
minimum support count is 2 minimum confidence is 60%

Step-1: K=1
(I) Create a table containing support count of each item present in
dataset – Called C1(candidate set)

Module 2 7
(II) compare candidate set item’s support count with minimum
support count(here min_support=2 if support_count of candidate
set items is less than min_support then remove those items). This
gives us itemset L1.

Step-2: K=2

Generate candidate set C2 using L1 (this is called join step).


Condition of joining L and L is that it should have (K-2)
elements in common.
k-1
k-1

Check all subsets of an itemset are frequent or not and if not


frequent remove that itemset.(Example subset of{I1, I2} are
{I1}, {I2} they are frequent.Check for each itemset)

Now find support count of these itemsets by searching in


dataset.

Module 2 8
(II) compare candidate (C2) support count with minimum
support count(here min_support=2 if support_count of
candidate set item is less than min_support then remove those
items) this gives us itemset L2.

Module 2 9
Step-3:

Generate candidate set C3 using L2 (join step). Condition


of joining L and L is that it should have (K-2) elements in
common. So here, for L2, first element should match.So
itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1,
I3, i5}{I2, I3, I4}{I2, I4, I5}{I2, I3, I5}

k-1
k-1

Check if all subsets of these itemsets are frequent or not


and if not, then remove that itemset.(Here subset of {I1, I2,
I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3,
I4}, subset {I3, I4} is not frequent so remove it. Similarly
check for every itemset)

find support count of these remaining itemset by searching


in dataset.

Module 2 10
(II) Compare candidate (C3) support count with minimum
support count(here min_support=2 if support_count of
candidate set item is less than min_support then remove
those items) this gives us itemset L3.

Step-4:

Generate candidate set C4 using L3 (join step).


Condition of joining L and L (K=4) is that, they should
have (K-2) elements in common. So here, for L3, first 2
elements (items) should match.

k-1
k-1

Check all subsets of these itemsets are frequent or not


(Here itemset formed by joining L3 is {I1, I2, I3, I5} so
its subset contains {I1, I3, I5}, which is not frequent).
So no itemset in C4

We stop here because no frequent itemsets are found


further

Generating Association Rules from Frequent item


Sets
Thus, we have discovered all the frequent item-sets. Now
generation of strong association rule comes into picture. For that

Module 2 11
we need to calculate confidence of each rule.
Confidence –

A confidence of 60% means that 60% of the customers, who


purchased milk and bread also bought butter.
Confidence(A->B)=Support_count(A∪B)/Support_count(A)

So here, by taking an example of any frequent itemset, we will


show the rule generation.
Itemset {I1, I2, I3} //from L3

SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%

[I2^I3]=>[I1] //confidence = sup(I1^I2^I3)/sup(I2^I3) = 2/4*100=50%


[I1]=>[I2^I3] //confidence = sup(I1^I2^I3)/sup(I1) = 2/6*100=33%

[I2]=>[I1^I3] //confidence = sup(I1^I2^I3)/sup(I2) = 2/7*100=28%


[I3]=>[I1^I2] //confidence = sup(I1^I2^I3)/sup(I3) = 2/6*100=33%
So if minimum confidence is 50%, then first 3 rules can be
considered as strong association rules.

Improving the Efficiency of Apriori


To improve the efficiency of level-wise generation of frequent
itemsets, an important property is used called Apriori
property which helps by reducing the search space.
Apriori Property –
All non-empty subset of frequent itemset must be frequent. The key
concept of Apriori algorithm is its anti-monotonicity of support
measure. Apriori assumes that

All subsets of a frequent itemset must be


frequent(Apriori property).If an itemset is
infrequent, all its supersets will be infrequent.

Module 2 12
Module 2 13

You might also like