0% found this document useful (0 votes)
27 views11 pages

Association Rule Mapping - Unit-4

IIIRD CSE SEM-I DWDM NOTES

Uploaded by

P.Padmini Rani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views11 pages

Association Rule Mapping - Unit-4

IIIRD CSE SEM-I DWDM NOTES

Uploaded by

P.Padmini Rani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

UNIT-IV

Association Analysis
Association mining aims to extract interesting correlations, frequent patterns, associations or
casual structures among sets of items or objects in transaction databases, relational database
or other data repositories. Association rules are widely used in various areas such as
telecommunication networks, market and risk management, inventory control, cross-
marketing, catalog design, loss-leader analysis, clustering, classification, etc.
Examples:
Rule Form: Body->Head [Support, confidence]
Buys (X, “Computer”) ->Buys (X, “Software”) [40%, 50%]
Association rule: basic concepts:
Given: (1) database of transaction, (2) each transaction is a list of items (purchased by a
customer in visit)
Find: all rules that correlate the presence of one set of items with that of another set of items.
 E.g., 98% of people who purchase tires and auto accessories also get done.
 E.g., Market Basket Analysis
 process analyzes customer buying habits by finding associations between the different
items that customers place in their “Shopping Baskets”. The discovery of such associations
can help retailers develop marketing strategies by gaining insight into which items are
frequently purchased together by customer.

Applications:
Maintenance agreement (what the store should do to boost maintenance agreement sales)
Home Electronics (what other products should the store stocks up?)
Attached mailing in direct marketing

Association Rule:

disjoint itemsets, i.e., X ∩ Y = ∅. The strength of an association rule can be measured in


An association rule is an implication expression of the form X->Y, where X and Y are

terms of its support and confidence. Support determines how often a rule is applicable to a
given data set, while confidence determines how frequently items in Y appear in
transactions that contain X. The formal definition of these metrics are

𝑁
Support, s(X->Y) = (𝑋∪Y)

𝜎(𝑋)
Confidence, c(X->Y) = (𝑋∪Y)

Why Use Support and Confidence? Support is an important measure because a rule that has
very low support may occur simply by chance. A low support rule is also likely to be
uninteresting from a business perspective because it may not be profitable to promote items
that customers seldom buy together. For these reasons, support is often used to eliminate
uninteresting rules.
Confidence, on the other hand, measures the reliability of the inference made by a rule. For a
given rule XY, the higher the confidence, the more likely it is for Y to be present in
transactions that contain X. Confidence also provides an estimate of the conditional
probability of Y given X.
Therefore, a common strategy adopted by many association rule mining algorithms is to
decompose the problem into two major subtasks:
1. Frequent Itemset Generation, whose objective is to find all the item- sets that satisfy the
minsupthreshold. These itemsets are called frequent itemsets.
2. Rule Generation, whose objective is to extract all the high-confidence rules from the
frequent itemsets found in the previous step. These rules are called strong rules.

Frequent Itemset Generation:


A lattice structure can be used to enumerate the list of all possible itemsets. Above Figure
shows an itemset lattice for I = {a, b, c, d, e}. In general, a data set that contains k items can
potentially generate up to 2k − 1 frequent itemsets, excluding the null set. Because k can be
very large in many practical applications, the search space of itemsets that need to be
explored is exponentially large.
To find frequent itemsets we have two algorithms,
a) Apriori Algorithm
b) FP-Growth

Confidence Based Pruning

For each frequent k-itemset, one can produce up to 2k-2 candidate associate rules. This starts
to get computationally expensive when you have 10-itemset values. Recall the anti-monotone
property from the previous section that was used to prune the frequent itemset. Unfortunately
for us, in general, confidence does not have this same property but confidence rules generated
from the same itemset has this anti-monotone property.
The theorem states that if a rule X→(Y-X) does not satisfy the confidence threshold, then any
rule X’→(Y-X’), where X’ is a subset of X, must not satisfy the confidence threshold as well.
The diagram below shows the pruning of association rules using this theorem.

Rule Generation in Apriori Algorithm

In the Apriori Algorithm a level-wise approach is used to generate association rules. First of
all the high confidence rules that have only one item in the rule consequent are extracted,
these rules are then used to generate new candidate rules.
For example, in the diagram above, if {134}→{2} and {124}→{3} are high confidence
rules, then the candidate rule {14}→{23} is generated by merging the consequents of both
rules.

Compact Representation of Frequent Itemset

Introduction

What happens when you have a large market basket data with over a hundred items?
The number of frequent itemsets grows exponentially and this in turn creates an issue with
storage and it is for this purpose that alternative representations have been derived which
reduce the initial set but can be used to generate all other frequent itemsets. The Maximal
and Closed Frequent Itemsets are two such representations that are subsets of the larger
frequent itemset that will be discussed in this section.

Maximal Frequent Itemset

Definition

It is a frequent itemset for which none of its immediate supersets are frequent.

Identification

1. Examine the frequent itemsets that appear at the border between the infrequent and
frequent itemsets.
2. Identify all of its immediate supersets.
3. If none of the immediate supersets are frequent, the itemset is maximal frequent.

Illustration

For instance consider the diagram shown below, the lattice is divided into two groups, red
dashed line serves as the dermarcation, the itemsets above the line that are blank are frequent
itemsets and the blue ones below the red dashed line are infrequent.

 In order to find the maximal frequent itemset, you first identify the frequent itemsets
at the border namely d, bc, ad and abc.
 Then identify their immediate supersets,
the supersets for d, bc are characterized by the blue dashed line and if you trace the
lattice you notice that for d, there are three supersets and one of them, ad is frequent
and this can’t be maximal frequent,
for bc there are two supersets namely abc and bcd abc is frequent and so bc is NOT
maximal frequent.
 The supersets for ad and abc are characterized by a solid orange line, the superset for
abc is abcd and being that it is infrequent, abcd is maximal frequent. For ad, there are
two supersets abd and acd, both of them are infrequent and so ad is also maximal
frequent.
Closed Frequent Itemset

Definition:

It is a frequent itemset that is both closed and its support is greater than or equal to minsup.
An itemset is closed in a data set if there exists no superset that has the same support count as
this original itemset.

Identification

1. First identify all frequent itemsets.


2. Then from this group find those that are closed by checking to see if there exists a
superset that has the same support as the frequent itemset, if there is, the itemset is
disqualified, but if none can be found, the itemset is closed.
An alternative method is to first identify the closed itemsets and then use the minsup
to determine which ones are frequent.

Illustration

The lattice diagram above shows the maximal, closed and frequent itemsets. The itemsets that
are circled with blue are the frequent itemsets. The itemsets that are circled with the thick
blue are the closed frequent itemsets. The itemsets that are circled with the thick blue and
have the yellow fill are the maximal frequent itemsets. In order to determine which of the
frequent itemsets are closed, all you have to do is check to see if they have the same support
as their supersets, if they do they are not closed.
For example ad is a frequent itemset but has the same support as abd so it is NOT a closed
frequent itemset; c on the other hand is a closed frequent itemset because all of its supersets,
ac, bc, and cd have supports that are less than 3.
As you can see there are a total of 9 frequent itemsets, 4 of them are closed frequent itemsets
and out of these 4, 2 of them are maximal frequent itemsets. This brings us to the relationship
between the three representations of frequent itemsets.

Relationship between Frequent Itemset Representations

In conclusion, it is important to point out the


relationship between frequent itemsets, closed
frequent itemsets and maximal frequent itemsets.
As mentioned earlier closed and maximal
frequent itemsets are subsets of frequent itemsets
but maximal frequent itemsets are a more
compact representation because it is a subset of
closed frequent itemsets. The diagram to the right
shows the relationship between these three types
of itemsets. Closed frequent itemsets are more
widely used than maximal frequent itemset
because when efficiency is more important that
space, they provide us with the support of the
subsets so no additional pass is needed to find
this information.

Apriori Algorithm

Apriori algorithm refers to the algorithm which is used to calculate the association rules
between objects. It means how two or more objects are related to one another. In other words,
we can say that the apriori algorithm is an association rule leaning that analyzes that people
who bought product A also bought product B.

The primary objective of the apriori algorithm is to create the association rule between
different objects. The association rule describes how two or more objects are related to one
another. Apriori algorithm is also called frequent pattern mining. Generally, you operate the
Apriori algorithm on a database that consists of a huge number of transactions. Let's
understand the apriori algorithm with the help of an example; suppose you go to Big Bazar
and buy different products. It helps the customers buy their products with ease and increases
the sales performance of the Big Bazar. In this tutorial, we will discuss the apriori algorithm
with examples.

Introduction

We take an example to understand the concept better. You must have noticed that the Pizza
shop seller makes a pizza, soft drink, and breadstick combo together. He also offers a
discount to their customers who buy these combos. Do you ever think why does he do so? He
thinks that customers who buy pizza also buy soft drinks and breadsticks. However, by
making combos, he makes it easy for the customers. At the same time, he also increases his
sales performance.Similarly, you go to Big Bazar, and you will find biscuits, chips, and
Chocolate bundled together. It shows that the shopkeeper makes it comfortable for the
customers to buy these products in the same place.

The above two examples are the best examples of Association Rules in Data Mining. It helps
us to learn the concept of apriori algorithms.

What is Apriori Algorithm?

Apriori algorithm refers to an algorithm that is used in mining frequent products sets and
relevant association rules. Generally, the apriori algorithm operates on a database containing
a huge number of transactions. For example, the items customers but at a Big Bazar.

Apriori algorithm helps the customers to buy their products with ease and increases the sales
performance of the particular store.

Components of Apriori algorithm

The given three components comprise the apriori algorithm.

1. Support
2. Confidence
3. Lift

Let's take an example to understand this concept.

We have already discussed above; you need a huge database containing a large no of
transactions. Suppose you have 4000 customers transactions in a Big Bazar. You have to
calculate the Support, Confidence, and Lift for two products, and you may say Biscuits and
Chocolate. This is because customers frequently buy these two items together.

Out of 4000 transactions, 400 contain Biscuits, whereas 600 contain Chocolate, and these 600
transactions include a 200 that includes Biscuits and chocolates. Using this data, we will find
out the support, confidence, and lift.

Support

Support refers to the default popularity of any product. You find the support as a quotient of
the division of the number of transactions comprising that product by the total number of
transactions. Hence, we get

= 200/400

= 50 percent.

It means that 50 percent of customers who bought biscuits bought chocolates also.
Lift

Consider the above example; lift refers to the increase in the ratio of the sale of chocolates
when you sell biscuits. The mathematical equations of lift are given below.

Lift = (Confidence (Biscuits - chocolates)/ (Support (Biscuits)

= 50/10 = 5

It means that the probability of people buying both biscuits and chocolates together is five
times more than that of purchasing the biscuits alone. If the lift value is below one, it requires
that the people are unlikely to buy both the items together. Larger the value, the better is the
combination.

How does the Apriori Algorithm work in Data Mining?

We will understand this algorithm with the help of an example

Consider a Big Bazar scenario where the product set is P = {Rice, Pulse, Oil, Milk, Apple}.
The database comprises six transactions where 1 represents the presence of the product and 0
represents the absence of the product.

Transaction ID Rice Pulse Oil Milk Apple

t1 1 1 1 0 0

t2 0 1 1 1 0

t3 0 0 0 1 1

t4 1 1 0 1 0

t5 1 1 1 0 1

t6 1 1 1 1 1

The Apriori Algorithm makes the given assumptions

 All subsets of a frequent itemset must be frequent.


 The subsets of an infrequent item set must be infrequent.
 Fix a threshold support level. In our case, we have fixed it at 50 percent.

Step 1

Make a frequency table of all the products that appear in all the transactions. Now, short the
frequency table to add only those products with a threshold support level of over 50 percent.
We find the given frequency table.

roduct Frequency (Number of transactions)


Rice (R) 4
Pulse(P) 5
Oil(O) 4
Milk(M) 4

The above table indicated the products frequently bought by the customers.

Step 2

Create pairs of products such as RP, RO, RM, PO, PM, OM. You will get the given
frequency table.

Itemset Frequency (Number of transactions)


RP 4
RO 3
RM 2
PO 4
PM 3
OM 2

Step 3

Implementing the same threshold support of 50 percent and consider the products that are
more than 50 percent. In our case, it is more than 3

Thus, we get RP, RO, PO, and PM

Step 4

Now, look for a set of three products that the customers buy together. We get the given
combination.

1. RP and RO give RPO


2. PO and PM give POM

Step 5

Calculate the frequency of the two itemsets, and you will get the given frequency table.

Itemset Frequency (Number of transactions)


RPO 4
POM 3

If you implement the threshold assumption, you can figure out that the customers' set of three
products is RPO.

We have considered an easy example to discuss the apriori algorithm in data mining. In
reality, you find thousands of such combinations.
How to improve the efficiency of the Apriori Algorithm?

There are various methods used for the efficiency of the Apriori algorithm

Hash-based itemset counting

In hash-based itemset counting, you need to exclude the k-itemset whose equivalent hashing
bucket count is least than the threshold is an infrequent itemset.

Transaction Reduction

In transaction reduction, a transaction not involving any frequent X itemset becomes not
valuable in subsequent scans.

Apriori Algorithm in data mining

We have already discussed an example of the apriori algorithm related to the frequent itemset
generation. Apriori algorithm has many applications in data mining.

The primary requirements to find the association rules in data mining are given below.

Use Brute Force

Analyze all the rules and find the support and confidence levels for the individual rule.
Afterward, eliminate the values which are less than the threshold support and confidence
levels.

The two-step approaches

The two-step approach is a better option to find the associations rules than the Brute Force
method.

Step 1

In this article, we have already discussed how to create the frequency table and calculate
itemsets having a greater support value than that of the threshold support.

Step 2

To create association rules, you need to use a binary partition of the frequent itemsets. You
need to choose the ones having the highest confidence levels.

In the above example, you can see that the RPO combination was the frequent itemset. Now,
we find out all the rules using RPO.

RP-O, RO-P, PO-R, O-RP, P-RO, R-PO

You can see that there are six different combinations. Therefore, if you have n elements, there
will be 2n - 2 candidate association rules.
Advantages of Apriori Algorithm

 It is used to calculate large itemsets.


 Simple to understand and apply.

Disadvantages of Apriori Algorithms

 Apriori algorithm is an expensive method to find support since the calculation has to
pass through the whole database.
 Sometimes, you need a huge number of candidate rules, so it becomes
computationally more expensive.

Apriori FP Growth
Apriori generates frequent patterns by making the
FP Growth generates an FP-Tree for
itemsets using pairings such as single item set, double
making frequent patterns.
itemset, and triple itemset.
Apriori uses candidate generation where frequent FP-growth generates a conditional FP-
subsets are extended one item at a time. Tree for every item in the data.
Since apriori scans the database in each step, it FP-tree requires only one database
becomes time-consuming for data where the number scan in its beginning steps, so it
of items is larger. consumes less time.
A converted version of the database is saved in the A set of conditional FP-tree for every
memory item is saved in the memory
It uses a breadth-first search It uses a depth-first search.

You might also like