Module 2
Module 2
Module 2
Association Rules
Frequent Patterns: The patterns that appear frequently in a Dataset.
Association rule mining is a technique used to identify patterns in large data sets. It
involves finding relationships between variables in the data and using those
relationships to make predictions or decisions. The goal of association rule mining is
to uncover rules that describe the relationships between different items in the data
set.
Association rules are if/then statements that help uncover relationships between
seemingly unrelated data in a relational database or another information repository.
An example of an association rule would be “If a customer buys a dozen eggs, he is
80% likely to also purchase milk.”(Market Basket Analysis)
The If element is called antecedent, and then the statement is called Consequent.
These types of relationships where we can find out some association or relation
between two items are known as single cardinality. It is all about creating rules, and
if the number of items increases, then cardinality also increases accordingly. So, to
measure the associations between thousands of data items, there are several
metrics.
Module 2 1
The metrics are :
Support
Support is the frequency of A or how frequently an item appears in the dataset. It
is defined as the fraction of the transaction T that contains the item-set X. If there
are X datasets, then for transactions T, it can be written as:
Confidence
Confidence indicates how often the rule has been found to be true. Or how often
the items X and Y occur together in the dataset when the occurrence of X is
already given. It is the ratio of the transaction that contains X and Y to the
number of records that contain X. or it measures the likelihood of an associated
item being purchased when the antecedent item is purchased. It is calculated as
the proportion of transactions containing the antecedent item in which the
associated item also appears.
OR
support(x, y)
Confidence =
support(x)
Lift
It is the strength of any rule, which can be defined as below formula:
It is the ratio of the observed support measure and expected support if X and Y
are independent of each other. It has three possible values:
Module 2 2
If Lift= 1: The probability of occurrence of antecedent and consequent is
independent of each other.
Lift>1: It determines the degree to which the two itemsets are dependent on
each other.
Lift<1: It tells us that one item is a substitute for other items, which means
one item has a negative effect on another.
Market basket analysis is a technique used in data mining and retail analytics to
identify relationships and patterns in customer purchasing behavior. It involves
analyzing transactional data, typically from point-of-sale systems, to uncover
associations between products that are frequently purchased together. The goal
is to understand the co-occurrence of items in a customer's shopping basket and
to provide insights that can be used for various purposes, such as product
recommendations, store layout optimization, and targeted marketing strategies.
Module 2 3
Terms in analysis
Module 2 4
Evaluation and Selection: The generated rules are evaluated based
on measures such as support, confidence, lift, and other metrics. The
selection of rules is based on the desired quality and significance
criteria.
Frequent Item-sets
frequent itemsets refer to sets of items that frequently occur together in
transactions above a specified minimum support threshold. The support of
an itemset is the proportion of transactions that contain all the items in the
set. By identifying frequent itemsets, retailers can uncover patterns and
associations among items that are commonly purchased together.
Frequent itemsets are typically discovered using algorithms like the Apriori
algorithm or the FP-Growth algorithm, which efficiently traverses the
transactional dataset to find itemsets that meet the minimum support
criteria.
Closed Item-sets
Closed itemsets are a specific type of frequent itemsets that do not have
any supersets with the same support. In other words, a closed itemset is
an itemset for which there is no other itemset containing the same items
Module 2 5
but with higher support. Closed itemsets capture the essential
associations without redundancy.
For example, if the itemset {A, B, C} has a support of 0.1, and there is no
other itemset that contains {A, B, C} with a support of 0.1 or higher, then
{A, B, C} is a closed itemset.
Closed itemsets are useful because they provide a more concise
representation of frequent itemsets and simplify the interpretation of
association rules.
Apriori Algorithm
Finding Frequent Itemset by Confined Candidate
Generation
Module 2 6
minimum support count is 2 minimum confidence is 60%
Step-1: K=1
(I) Create a table containing support count of each item present in
dataset – Called C1(candidate set)
Module 2 7
(II) compare candidate set item’s support count with minimum
support count(here min_support=2 if support_count of candidate
set items is less than min_support then remove those items). This
gives us itemset L1.
Step-2: K=2
Module 2 8
(II) compare candidate (C2) support count with minimum
support count(here min_support=2 if support_count of
candidate set item is less than min_support then remove those
items) this gives us itemset L2.
Module 2 9
Step-3:
k-1
k-1
Module 2 10
(II) Compare candidate (C3) support count with minimum
support count(here min_support=2 if support_count of
candidate set item is less than min_support then remove
those items) this gives us itemset L3.
Step-4:
k-1
k-1
Module 2 11
we need to calculate confidence of each rule.
Confidence –
SO rules can be
[I1^I2]=>[I3] //confidence = sup(I1^I2^I3)/sup(I1^I2) = 2/4*100=50%
[I1^I3]=>[I2] //confidence = sup(I1^I2^I3)/sup(I1^I3) = 2/4*100=50%
Module 2 12
Module 2 13