Chapter 3
Chapter 3
Chapter Three
Data Warehousing and Data Mining
November 13, 2021
CHAPTER-3
Association rule Mining
Description
Principle
Design
Algorithm
Rule evaluation
• What Is Association Rule Mining?
• Association rule mining Finding frequent patterns, associations,
correlations, or causal structures among sets of items in transaction
databases.
• Understand customer buying habits by finding associations and
correlations between the different items that customers place in their
“shopping basket” .
• Applications Basket data analysis, cross-marketing, catalog design,
loss-leader analysis, web log analysis, fraud detection
(supervisor->examiner)
• Basic Concepts Association Rule:
• Basic Concepts Given: (1) database of transactions,
• (2) each transaction is a list of items purchased by a customer in a
visit
• Rule basic Measures:
• Note:
• Minimum_Support is a threshold parameter you need to specify
before processing an association model.
• It means that you are interested only in those item sets and rules
that represent at least minimum support of the dataset.
• Probability(Confidence)
Probability is a property of an association rule.
• The probability of a rule A=>B is calculated using the support of
item set {A,B} divided by the support of {A}.
• This probability is also called confidence in the data mining research
community. It is defined as follows:
• Probability (A => B) = Probability (B|A) = Support (A, B)/ Support
(A)
• Support and Confidence for Item set A and B are represented by
formulas:
• Importance
• Importance is also called the interesting score or the lift in some
literature. Importance can be used to measure item sets and rules.
The importance of an item set is defined using the following
formula:
• Importance ({A,B}) = Probability (A, B)/(Probability (A)*
Probability (B))
• If importance = 1, A and B are independent items. It means that
the purchase of product A and purchase of product B are two
independent events.
• If importance < 1, A and B are negatively correlated. This means if a
customer buys A, it is unlikely he will also buy B.
• If importance > 1, A and B are positively correlated. This means if a
customer buys A, it is very likely he also buys B.
• An importance of 0 means that there is no association between A
and B.
• A positive importance score means that the probability of B goes up
when A is true.
• A negative importance score means that the probability of B goes
down when A is true.
• Candidate Generation
• An item set of size k+1 is candidate to be frequent only if all of its
subsets of size k are known to be frequent.
• Association rule algorithms
• Most algorithms used to identify large itemsets can be classified as
either sequential or parallel.
• i) Sequential Algorithms:
• i) AIS: The AIS algorithm makes multiple passes over the entire
database. During each pass, it scans all transactions.
• In the first pass, it counts the support of individual items and
determines which of them are large or frequent in the database.
• Large itemsets of each pass are extended to generate candidate
itemsets.
• After scanning a transaction, the common itemsets between large
itemsets of the previous pass and items of this transaction are
determined.
• The AIS algorithm was the first published algorithm developed to
generate all large itemsets in a transaction database.
• Advantage:
• The algorithm was used to find if there was an association between
departments in the customer’s purchasing behavior.
• Disadvantage:
• Drawback of the AIS algorithm is that the data structures required
for maintaining large and candidate itemsets were not specified.
• ii)SETM:
• Similar to the AIS algorithm, the SETM algorithm makes multiple
passes over the database.
• In the first pass, it counts the support of individual items and
determines which of them are large or frequent in the database.
Then, it generates the candidate itemsets
• the SETM remembers the TIDs of the generating transactions with
the candidate itemsets.
• The relational merge-join operation can be used to generate
candidate itemsets.
• Advantage:
• Generating candidate sets, the SETM algorithm saves a copy of the
candidate itemsets together with TID of the generating transaction
in a sequential manner.
• Disadvantage:
• Since for each candidate itemset there is a TID associated with it, it
requires more space to store a large number of TIDs.
• Apriori Algorithm :
T1 I1,I2,I5
T2 I2,I4
T3 I2,I3
T4 I1,I2,I4
T5 I1,I3
T6 I2,I3
T7 I1,I3
T8 I1,I2,I3,I5
T9 I1,I2,I3
• STEP:1
• SCAN D for count of each candidate "C1".
• C1 L1
ITEMS SUPPORT
COUNT ITEMS SUPPORT COUNT
{I1} 6 {I1} 6
{I2} 7 {I2} 7
{I3} 6
{I3} 6
{I4} 2
{I4} 2
2
{I5}
• STEP-2: {I5} 2
{I3,I4} 0
{I3,I5} 1
{I4,I5} 0
{I1,I2,I3} 2
ITEMS SUPPORT COUNT
{I1,I2,I5} 2
{I1,I2,I3} 2
{I1,I2,I4} 1
{I1,I2,I5} 2
{I1,I3,I5} 1
{I2,I3,I4} 0
{I2,I3,I5} 1
{I2,I4,I5} 0
{I1,I2,I3,I5} 1
• STEP-8
• Compare candidate support count with minimum support count
"L4"
• L4